Split string on a backslash ("\") delimiter in awk? - regex

I am trying to split the string in a file based on some delimiter.But I am not able to achieve it correctly... Here is my code below.
awk 'var=split($2,arr,'\'); {print $var}' file1.dat
Here is my sample data guys.
Col1 Col2
abc 123\abc
abcd 123\abcd
Desire output:
Col1 Col2
abc abc
abcd abcd

You don't need to call split. Just use \\ as field separator:
echo 'a\b\c\d' | awk -F\\ '{printf("%s,%s,%s,%s\n", $1, $2, $3, $4)}'
OUTPUT:
a,b,c,d

Sample data and output is my best guess at your requirement
echo '1:2\\a\\b:3' | awk -F: '{
n=split($2,arr,"\\")
# print "#dbg:n=" n
var=arr[3]
print var
}'
output
b
Recall that split returns the number of fields that it found to split. You can uncomment the debug line and you'll see the value 3 returned.
Note also that for my test, I had to use 2 '\' chars for 1 to be processed. I don't think you'll need that in a file, but if this doesn't work with a file, then try adding extra '\' as needed to your data. I tried several variations on how to use '\', and this seems the most straightforward. Others are welcome to comment!
I hope this helps.

As some of the comments mentioned, you have nested single quotes. Switching one set to use double quotes should fix it.
awk 'var=split($2,arr,"\"); {print $var}' file1.dat
I'd prefer piping to another awk command to using split.I don't know that one is better than the other, it is just a preference.
awk '{print $2}' file1.dat | awk -F'\' '{...}'

You need to escape the backslash you're trying to split on. You can do this in you split using double-quotes like this: "\\"
Also, you can take an array slice to make your code more readable (and avoid defining another var). This should work for you:
awk 'NR==1 { print } NR>=2 { split($0,array,"\\"); print $1,array[2] }' file1.dat
HTH

awk '{sub(/123\\/,"")}1' file
Col1 Col2
abc abc
abcd abcd

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Regex replacement for SQL using sed

I have a file containing many SQL statements and need to add escape characters, using SED, for single quotes withing the SQL statements. Consider the following:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O'Briens');
In the above we need to escape the single quote in O'Briens. Using regex I can find the string using [a-zA-Z ]'[a-zA-Z ].
So this will find the 3 characters of interest, however if I do the following sed command:
sed -i "s/[a-zA-Z ]'[a-zA-Z ]/''/g" file.sql
This, however, removes the O and the B so I end up with:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at ''riens');
How do I isolate/reference the O and the B so the string becomes:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O''Briens');
Use capture groups to copy parts of the input to the result.
sed -r -i "s/([a-zA-Z ])'([a-zA-Z ])/\1''\2/g" file.sql
You could do this in awk. Simple explanation would be, perform substitution on last field of line, where substitute ' with 2 instances of ' and print the line then.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file
Above code will only print the lines in output, in case you want to perform inplace save then try following.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file > temp && mv temp Input_file

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

Parse a file without common delimiter in shell

I would like to ask you for help with parsing a file in shell.
Here is my data:
ID:1 g-t="Demo one" rfid="af7e 25" t-link="http://demo.site.com/api2",User af73 25 http://example.com/useraf73
ID:2 g-t="Demo one" rfid="77 63" t-link="http://demo.site.com/api",User 77 http://example.com/user77
There is no common delimiter, basically I need these fields:
ID=1 | g-t="Demo one" | rfid="af7e 25" | t-link="http://demo.site.com/api2" | User af73 25 | http://example.com/useraf73
Here is where I am stuck:
awk '{match($0,"g-t=([^\" ]+)",a)}END{print a[1]}'
I am trying to match double quote with space but I have no idea why it is not printing the result. All the chars work fine except double quotes.
What I am doing wrong? Awk is not a must here, I am open to suggestions.
Thanks.
It has been quite a while since I regularly used awk but if I remember correctly match() takes only 2 args and END{} happens only once, not for every line like I think you want. Something like:
awk '{match($0,/g-t="([^\"]+")/); print substr($0, RSTART, RLENGTH)}' dataFile
may be closer to what you had in mind?
A brute force Perl one-liner could look something like this:
perl -lne 'if (m/ID:(\S+) g-t="([^"]+)" rfid="([^"]+)" t-link="([^"]+)",User (.*) (http:.*)/){print "$1|$2|$3|$4|$5|$6"}' dataFile
and demonstrates getting all of the fields data separated by OR bars. You can move the () groups around to get more or less of the text you want for each resultant $1, $2 etc... See perldoc perl for more information.

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.