Selectively remove subfields from a CSV file in sed - regex

I have a CSV file my.csv, in which fields are separated by ;. Each field contains arbitrary number (sometimes zero) of subfields separated by |, like this:
aa5|xb1;xc3;ba7|cx2|xx3|da2;ed1
xa2|bx9;ab5;;af2|xb5
xb7;xa6|fc5;fd6|xb5;xc3|ax9
df3;ab5|xc7|de2;da5;ax2|xd8|bb1
I would like to remove all sub-fields (with corresponding |'s) that start from everything but x, i.e. get output like this:
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8
Now I am doing this in multiple steps with sed:
sed -i 's/^[^;x]*;/;/g' my.csv #In 1st fields without x.
sed -i 's/;[^;x]*;/;;/g' my.csv #In middle field without x.
sed -i 's/;[^;x]*$/;/g' my.csv #In last field without x.
sed -i 's/^[^;x][^;]*|x/x/g' my.csv #In 1st fields with x. before x.
sed -i 's/;[^;x][^;]*|x/;x/g' my.csv #In non-1st fields with x. before x.
sed -i 's/|[^x][^;]*//g' my.csv #In fields with x. after x.
Is there a way to do it one line or at least more simple? I got stuck on the problem how to match "line beginning OR ';'".
In my case it is guaranteed that there is no more than one subfield starting with x. In theory, however, it would be also useful how to solve the problem if it is not the case (e.g., convert field ab1|xa2|bc3|xd4|ex5 to xa2|xd4).

Using sed
sed ':;s/\(^\||\|;\)[^x;|][^;|]*/\1/;t;s/|//g' file
Just loops through removing fields that don't begin with x and then removes the bars.
Output
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8

You can use this awk:
awk 'BEGIN{FS=OFS=";"} {for (i=1; i<=NF; i++) {
gsub(/(^|\|)[^x][^|]*/, "", $i); sub(/^\|/, "", $i)}} 1' file
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8
This will also convert ab1|xa2|bc3|xd4|ex5 to xa2|xd4 i.e. multiple fields starting with x.

Consider using Perl:
perl -ple '$_ = join(";", map { join "|", grep /^x/, split /\|/ } split(/;/, $_, -1))'
This starts with split(/;/, $_, -1), splitting the line ($_ at this point) into an array of fields at semicolons. The negative limit parameter makes it so that trailing empty fields (if they exist) are not discarded.
The elements of that array are
transformed in the map expression, and
joined again with semicolons.
The transformation in the map expression is
splitting along |,
grepping for /^x/ (i.e., weeding out those that don't match the regex),
joining with | again.
I believe this structured approach to be more robust than regex wizardry.
Old code that loses empty fields at the end of a line:
perl -F\; -aple '$_=join(";", map { join("|", grep(/^x/, split(/\|/, $_))) } #F)'
This used -a for auto-split, which looked nicer but didn't have the fine-grained control over field splitting that was required.

awk to the rescue!
awk -F";" -vOFS=";" '
{line=sep="";
for(i=1;i<=NF;i++) {
c=split($i,s,"|");
for(j=1;j<=c;j++)
if(s[j]~/^x/) {
line=line sep s[j];
sep=OFS
}
}
print line}'
split each element further for pattern check, combine results in a line, set separator after the first element is set on each line.

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

awk - parse text having same character in fields as delimiter

Consider this source:
field1;field2;"data;data field3";field4;"data;data field5";field6
field1;"data;data field2";field3;field4;field5;"data;data field6"
As you can see, the field delimiter is being used inside certain fields, enclosed between ". I cannot directly parse with awk because there is no way of avoiding unwanted splitting, at least I haven't found a way. Moreover, those special fields have a variable position within a line and they can occur once, twice, 4 times etc.
I thought of a solution involving a pre-parsing step, where I replace the ; in those fields with a code of some sort. The problem is that sed / awk perform greedy REGEX match. So in the above example, I can only replace ; within the last field enclosed in quotes in each line.
How can I match each instance of quotes and replace the specific ; within them? I do not want to use perl or python etc.
Using gnu awk you can use special FPAT variable to have a regex for your fields.
You can use this command to replace all ; by | inside the double quotes:
awk -v OFS=';' -v FPAT='"[^"]*"|[^;]*' '{for (i=1; i<=NF; i++) gsub(/;/, "|", $i)} 1' file
field1;field2;"data|data field3";field4;"data|data field5";field6
field1;"data|data field2";field3;field4;field5;"data|data field6"
As an alternative to FPAT you can set the awk FS to be double quotes and then swap out your semicolon delimiter for every other field:
awk -F"\"" '{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/;/, "|", $i)}} {print $0}' yourfile
Here awk is:
Splitting the record by double quote (-F"\"")
Looping through each field that it finds ({for(i=1;i<=NF;++i))
Testing the field ordinal's mod 2 if it's 0 (if(i%2==0))
If it's even then it swaps out the semicolons with pipes (gsub(/;/, "|", $i))
Prints out the transformed record ({print $0})

Swap Strings within a line in Bash

I'm parsing a document with a bash script and output different parts of it. At one point i need find and reformat text in the form of:
(foo)[X]
[Y]
(bar)[Z]
to something like:
X->foo
Y
Z->bar
Now, I'm able to grep the parts I want with RegEx, but I'm having trouble swapping the two elements in one line and handling the fact that the text in parentheses is optional. Is this even possible with a combination of sed and grep?
Thank You for your time.
You can use sed:
sed -e 's/(\([^)]*\))\[\([^]]*\)]/\2->\1/' -e 's/\[\([^]]*\)]/\1/' file
This works for your given input example:
X->foo
Y
Z->bar
You might need to make the patterns more strict if you have more kinds of input to handle.
You can use awk:
awk -F '[][()]+' '{print (NF>3 ? $3 "->" $2 : $2)}' file
X->foo
Y
Z->bar
You can even do it in bash itself, although it's not pretty.
# Three capture groups:
# 1. The optional paranthesized text
# 2. The contents of the parentheses
# 3. The contents of the square brackets
regex="(\((.*)\))?\[(.*)\]"
while IFS= read -r str; do
[[ "$str" =~ $regex ]]
# If the 2nd array element is not empty, print -> followed by the
# non-empty value.
echo "${BASH_REMATCH[3]}${BASH_REMATCH[2]:+->${BASH_REMATCH[2]}}"
done < file.txt

Replace delimiter in csv that is not between square brackets

I have a lot of csv files that I am having trouble reading since the delimiter is ',' and one of the fields is a list with comma separated values in square brackets. As an example:
first,last,list
John,Doe,['foo','234','&3bar']
Johnny,Does,['foofo','abc234','d%9lk','other']
I would like to change the delimiter to '|' (or whatever else) to get:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
How can I do this? I'm trying to use sed right now, but anything that works is fine.
I don't know it could be possible through sed or awk but you could do this easily through perl.
$ perl -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
Run the below command to save the changes made to that file.
perl -i -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
If it's always 2 values before the list, you could make use of the limit argument to split in perl:
perl -pe '$_ = join "|", split /,/, $_, 3' list
This splits on commas up to a maximum number of 3 fields, then joins them back together with a pipe. The -p switch means that each line of input is stored as $_ and processed before, then $_ is printed.
Output:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']