Replacing text between n-th and (n+1)-th delimiter using sed - regex

I wonder how I can change a single value on particular position in pipe delimited dataset.
For example, I have data set:
01|456|AAAA|James Bond|AAAA|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|AAAA|6296903
04|3213|AAAA|AAAA|BBBB|62969
I want to change every "AAAA" value to "XXXX", but only between 4th and 5th pipe character ( | ). So, expected output will look like:
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969
Is it achievable using only sed function, or is it necessary to use something like awk.

Set input field separator (FS), output field separator (OFS) and if column 5 contains AAAA replace by XXXX:
awk 'BEGIN{FS=OFS="|"} $5=="AAAA" {$5="XXXX"}1' file
Output:
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969

Better to use awk for this:
awk 'BEGIN{FS=OFS="|"} {gsub(/A/, "X", $5)} 1' file
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969
BEGIN{FS=OFS="|"} uses pipe as a input & output field separators
gsub(/A/, "X", $5) replaces each A with X in $5 for 5th column only
1 is default action to print each line

awk -v start=4 -v end=5 'BEGIN{FS=OFS="|"}{for(i=start;i<=end;i++) gsub(/AAAA/,"XXXX",$i)}1' inputfile
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|XXXX|BBBB|62969
Based on the value of start and end variable, gensub function will do the replacement between the columns falling between these values.

This might work for you (GNU sed):
sed -r ':a;s/^(([^|]*\|){4}X*)[^X|]/\1X/;ta' file
Iterate, replacing all characters that are not an X or a | to an X from the fourth | character.

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

Delete rows with extra delimiter from csv file in unix

I have a csv file with 3 columns separated by ',' delimiter. Some values have , in data and I would like to remove the whole record. Suggest if I can do this using sed/awk,grep commands .
Input file :
monitor,display,45
keyboard,input,20
loud,speaker,output,20
mount,input,20
Expected Output :
monitor,display,45
keyboard,input,20
mount,input,20
I used grep command to filter out rows with extra commas.
grep -v '.*,.*,.*,.*' input_file > output_file.
We need to define the regex pattern between .*
-v excludes the records which match the pattern specified.
Below is how you can do the same using awk , basically you want the record in which there are exactly 3 fields
$ awk -F, 'NF==3 {print $0}' data1.txt
monitor,display,45
keyboard,input,20
mount,input,20

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

Replacing specific characters in first column of text

I have a text file and I'm trying to replace a specific character (.) in the first column to another character (-). Every field is delimited by comma. Some of the lines have the last 3 columns empty, so they have 3 commas at the end.
Example of text file:
abc.def.ghi,123.4561.789,ABC,DEF,GHI
abc.def.ghq,124.4562.789,ABC,DEF,GHI
abc.def.ghw,125.4563.789,ABC,DEF,GHI
abc.def.ghe,126.4564.789,,,
abc.def.ghr,127.4565.789,,,
What I tried was using awk to replace '.' in the first column with '-', then print out the contents.
ETA: Tried out sarnold's suggestion and got the output I want.
ETA2: I could have a longer first column. Is there a way to change ONLY the first 3 '.' in the first column to '-', so I get the output
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,
. is regexp notation for "any character". Escape it with \ and it means .:
$ awk -F, '{gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi 123.4561.789 ABC DEF GHI
abc-def-ghq 124.4562.789 ABC DEF GHI
abc-def-ghw 125.4563.789 ABC DEF GHI
abc-def-ghe 126.4564.789
abc-def-ghr 127.4565.789
$
The output field separator is a space, by default. Set OFS = "," to set that:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi,123.4561.789,ABC,DEF,GHI
abc-def-ghq,124.4562.789,ABC,DEF,GHI
abc-def-ghw,125.4563.789,ABC,DEF,GHI
abc-def-ghe,126.4564.789,,,
abc-def-ghr,127.4565.789,,,
This still allows changing multiple fields:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); gsub("1", "#",$2); print}' textfile.csv
abc-def-ghi,#23.456#.789,ABC,DEF,GHI
abc-def-ghq,#24.4562.789,ABC,DEF,GHI
abc-def-ghw,#25.4563.789,ABC,DEF,GHI
abc-def-ghe,#26.4564.789,,,
abc-def-ghr,#27.4565.789,,,
I don't know what -OFS, does, but it isn't a supported command line option; using it to set the output field separator was a mistake on my part. Setting OFS within the awk program works well.
This might work for you:
awk -F, -vOFS=, '{for(n=1;n<=3;n++)sub(/\./,"-",$1)}1' file
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,