Replace last occurrence of a character in a field with awk - regex

I'm trying to replace the last occurrence of a character in a field with awk. Given is a file like this one:
John,Doe,Abc fgh 123,Abc
John,Doe,Ijk-nop 45D,Def
John,Doe,Qr s Uvw 6,Ghi
I want to replace the last space " " with a comma ",", basically splitting the field into two. The result is supposed to look like this:
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
I've tried to create a variable with the number of occurrences of spaces in the field with
{var1=gsub(/ /,"",$3)}
and then integrate it in
{var2=gensub(/ /,",",var1,$4); print var2}
but the how-argument in gensub does not allow any characters besides numbers and G/g.
I've found a similar thread here but wasn't able to adapt the solution to my problem.
I'm fairly new to this so any help would be appreciated!

With GNU awk for gensub():
$ awk 'BEGIN{FS=OFS=","} {$3=gensub(/(.*) /,"\\1,","",$3)}1' file
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Get the book Effective Awk Programming by Arnold Robbins.
Very well-written question btw!

Here is a short awk
awk '{$NF=RS$NF;sub(" "RS,",")}1' file
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Updated due to Eds comment.
Or you can use the rev tools.
rev file | sed 's/ /,/' | rev
John,Doe,Abc fgh,123,Abc
John,Doe,Ijk-nop,45D,Def
John,Doe,Qr s Uvw,6,Ghi
Revers the line, then replace first space with ,, then revers again.

Related

compare with one of the 5 options

I am trying to find words where second or third character is one of "aeiou".
# cat t.txt
test
platform
axis
welcome
option
I tried this but the word "platform" and "axis" is missing in the output.
# awk 'substr($0,2,1) == "e" {print $0}' t.txt
test
welcome
You may use this awk solution that matches 1 or 2 of any char followed by one of the vowels:
awk '/^.{1,2}[aeiou]/' file
test
platform
axis
welcome
Or else use substr function to get a substring of 2nd and 3rd char and then compare with one of the vowels:
awk 'substr($0,2,2) ~ /[aeiou]/ ' file
test
platform
axis
welcome
As per comment below OP wants to get string without vowels in 2nd or 3rd position, Here is a solution for that:
awk '{
s=substr($0,2,2)
gsub(/[aeiou]+/, "", s)
print substr($0,1,1) s substr($0, 4)
}' file
tst
pltform
axs
wlcome
option
PS: This sed would be shorter for replacement:
sed -E 's/^(.[^aeiou]{0,1})[aeiou]{1,2}/\1/' file
With your shown samples only, please try following awk code. Written and tested in GNU awk, should work in any awk. Simple explanation would be, setting field separator as NULL and checking if 2nd OR 3rd field(character in current line basically) is any of a e i o u then print that line.
awk -v FS="" '$2~/^[aeiou]$/ || $3~/^[aeiou]$/' Input_file
This might work for you (GNU sed):
sed '/\<..\?[aeiou]/I!d' file
If the second or third character from the start of a word boundary is either a,e,i,o or u (of any case) don't delete the line.
I would harness GNU AWK for this task following way, let file.txt content be
test
platform
axis
welcome
option
then
awk 'BEGIN{FPAT="."}index("aeiou", $2)||index("aeiou", $3)' file.txt
gives output
test
platform
axis
welcome
Explanation: I inform GNU AWK that field is any single character (.) using FPAT, then I filter lines using index function, if 2nd field that is 2nd character is anywhere inside aeiou then index returns value greater than zero which is treated as true in boolean context and apply same function for 3rd field that is 3rd character and then apply logical OR (||) to their effects.
(tested in gawk 4.2.1)

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Simplify points in KML using regex

I am trying to cut down the file size of a kml file I have.
The coordinates for the polygons are this accurate:
-113.52106535153605,53.912817815321503,0.
I am not very good with regex, but I think it would be possible to write one that selects the eight characters before the commas. I'd run a search and replace so the result would be
-113.521065,53.9128178,0.
Any regex experts out there think this is possible?
Try this
\d{8}(?=,)
and replace with an empty string
See it here on Regexr
Here is something that might work. Replaces 8 chars and the coma with a coma: s/(.{8}),/,/g;
echo "-113.52106535153605,53.912817815321503,0." | sed 's/.\{8\},/,/'
So you can cat the file you have to a sed command like this:
cat file.kml | sed 's/.\{8\},/,/' > newfile.kml
I Just had to do the same thing. This is perl instead of sed, but it will look for a string of eight uninterrupted digits and then replace any number of uninterrupted digits after that with nothing. It worked great.
cat originalfile.kml | perl -pe 's/(?<=\d{8})\d*//g' > shortenedfile.kml