Replace delimiter in csv that is not between square brackets - regex

I have a lot of csv files that I am having trouble reading since the delimiter is ',' and one of the fields is a list with comma separated values in square brackets. As an example:
first,last,list
John,Doe,['foo','234','&3bar']
Johnny,Does,['foofo','abc234','d%9lk','other']
I would like to change the delimiter to '|' (or whatever else) to get:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
How can I do this? I'm trying to use sed right now, but anything that works is fine.

I don't know it could be possible through sed or awk but you could do this easily through perl.
$ perl -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
Run the below command to save the changes made to that file.
perl -i -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file

If it's always 2 values before the list, you could make use of the limit argument to split in perl:
perl -pe '$_ = join "|", split /,/, $_, 3' list
This splits on commas up to a maximum number of 3 fields, then joins them back together with a pipe. The -p switch means that each line of input is stored as $_ and processed before, then $_ is printed.
Output:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Replace all commas between two quotes in a bash script

I need that all "," between two " are replaced with ";" within a bash script. I'm close, but hours on the internet and stackoverflow led me to this:
echo ',,Lung,,"Lobular, each.|lungs, right.",false,,,,"organ, left.",,,,,' | sed -r ':a;s/(".*?),(.*?")/\1;\2/;ta'
With the result:
,,Lung,,"Lobular; each.|lungs; right.";false;;;;"organ; left.",,,,,
Correct would be:
,,Lung,,"Lobular; each.|lungs; right.",false,,,,"organ; left.",,,,,
Not sure how you want to deal with lines that have an odd number of double quotes (eg, the double quoted string spans multiple lines), but perhaps:
awk '!(NR%2){gsub(",",";")} 1' RS=\" ORS=\"
This simply treats " as the record separator and does the replacement only on odd numbered records. Seems to work as desired. (Or, rather, it works as you seem to desire!)
As oguz points out in a comment, this prints an extra " at the end. That can be fixed with:
awk '!(NR%2){gsub(",",";")} {printf RFS $0} {RFS="\""}' RS=\"
which is a bit uglier but more correct . (or, rather, less incorrect!) If your input stream ends with a ", that quote will be truncated. If, however, your input is terminated by a newline rather than a ", this will do what you want.
OTOH, you might just want to do:
perl -wpE 'BEGIN{$/=\1}; y/,/;/ if $in; $in = ! $in if $_ eq "\""'
Which reads one character and uses a simple state machine. ($_ is the current character, so $in = ! $in changes state when a double quote is seen and the transliteration only happens when $in is non-zero.)
If you /really/ wanted to use sed, you could do a whole line replace and include a clause like ^(([^"]*"[^"*]")*[^"]*) at the beginning of your existing expression in order to ensure that the matched quotes are "odd".

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Swap Strings within a line in Bash

I'm parsing a document with a bash script and output different parts of it. At one point i need find and reformat text in the form of:
(foo)[X]
[Y]
(bar)[Z]
to something like:
X->foo
Y
Z->bar
Now, I'm able to grep the parts I want with RegEx, but I'm having trouble swapping the two elements in one line and handling the fact that the text in parentheses is optional. Is this even possible with a combination of sed and grep?
Thank You for your time.
You can use sed:
sed -e 's/(\([^)]*\))\[\([^]]*\)]/\2->\1/' -e 's/\[\([^]]*\)]/\1/' file
This works for your given input example:
X->foo
Y
Z->bar
You might need to make the patterns more strict if you have more kinds of input to handle.
You can use awk:
awk -F '[][()]+' '{print (NF>3 ? $3 "->" $2 : $2)}' file
X->foo
Y
Z->bar
You can even do it in bash itself, although it's not pretty.
# Three capture groups:
# 1. The optional paranthesized text
# 2. The contents of the parentheses
# 3. The contents of the square brackets
regex="(\((.*)\))?\[(.*)\]"
while IFS= read -r str; do
[[ "$str" =~ $regex ]]
# If the 2nd array element is not empty, print -> followed by the
# non-empty value.
echo "${BASH_REMATCH[3]}${BASH_REMATCH[2]:+->${BASH_REMATCH[2]}}"
done < file.txt