awk - parse text having same character in fields as delimiter - regex

Consider this source:
field1;field2;"data;data field3";field4;"data;data field5";field6
field1;"data;data field2";field3;field4;field5;"data;data field6"
As you can see, the field delimiter is being used inside certain fields, enclosed between ". I cannot directly parse with awk because there is no way of avoiding unwanted splitting, at least I haven't found a way. Moreover, those special fields have a variable position within a line and they can occur once, twice, 4 times etc.
I thought of a solution involving a pre-parsing step, where I replace the ; in those fields with a code of some sort. The problem is that sed / awk perform greedy REGEX match. So in the above example, I can only replace ; within the last field enclosed in quotes in each line.
How can I match each instance of quotes and replace the specific ; within them? I do not want to use perl or python etc.

Using gnu awk you can use special FPAT variable to have a regex for your fields.
You can use this command to replace all ; by | inside the double quotes:
awk -v OFS=';' -v FPAT='"[^"]*"|[^;]*' '{for (i=1; i<=NF; i++) gsub(/;/, "|", $i)} 1' file
field1;field2;"data|data field3";field4;"data|data field5";field6
field1;"data|data field2";field3;field4;field5;"data|data field6"

As an alternative to FPAT you can set the awk FS to be double quotes and then swap out your semicolon delimiter for every other field:
awk -F"\"" '{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/;/, "|", $i)}} {print $0}' yourfile
Here awk is:
Splitting the record by double quote (-F"\"")
Looping through each field that it finds ({for(i=1;i<=NF;++i))
Testing the field ordinal's mod 2 if it's 0 (if(i%2==0))
If it's even then it swaps out the semicolons with pipes (gsub(/;/, "|", $i))
Prints out the transformed record ({print $0})

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Regex replacement for SQL using sed

I have a file containing many SQL statements and need to add escape characters, using SED, for single quotes withing the SQL statements. Consider the following:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O'Briens');
In the above we need to escape the single quote in O'Briens. Using regex I can find the string using [a-zA-Z ]'[a-zA-Z ].
So this will find the 3 characters of interest, however if I do the following sed command:
sed -i "s/[a-zA-Z ]'[a-zA-Z ]/''/g" file.sql
This, however, removes the O and the B so I end up with:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at ''riens');
How do I isolate/reference the O and the B so the string becomes:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O''Briens');
Use capture groups to copy parts of the input to the result.
sed -r -i "s/([a-zA-Z ])'([a-zA-Z ])/\1''\2/g" file.sql
You could do this in awk. Simple explanation would be, perform substitution on last field of line, where substitute ' with 2 instances of ' and print the line then.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file
Above code will only print the lines in output, in case you want to perform inplace save then try following.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file > temp && mv temp Input_file

Remove a specific part of string split by multiple-charater delimiter in bash

I'm trying to process my string and steps as below:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
And I want to remove a part of string (and delimiter) that match myvar after split by delimiter \n, the result should be
\\[[123 one (/)\n\\[[789 three (/)
But I still not find out the solution.
If the delimiter is a single character like :, I can be done with sed command:
mytext2='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar2=456
echo $mytext2 | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
Result as expected: \\[[123 one (/):\\[[789 three (/)
How can be done if delimiter is a multiple-character in this case?
Thanks.
Even if you use sed -E, you still lack some support (?<=, ?!, ?=, ? etc). I suggest you use perl (Perl Compatible Regular Expressions).
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)';
myvar=456;
echo $mytext | perl -pe "s/(?<=\\\n).*${myvar}.*?(\\\n|$)//g";
Details:
(?<=\\\n): string starts after \n. \\ escape character \
.*${myvar}.*?(\\\n|$): get string which contains value of variable myvar and ends with \n or end of line.
Result.
\\[[123 one (/)\n\\[[789 three (/)
If you can find another delimiter to be used for example, :, you can first replace it with :,
echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g"
The full script looks like this
mytext='\\[[123 one (/):\\[[456 two (/):\\[[789 three (/)'
myvar=456
v=$(echo $mytext | tr '\n' ':' | sed -E "s/[^:]*${myvar}[^:]*(:|$)//g")
echo $v
If the multiple character delimiter is something like abcd, you can try to use sed to replace it first instead of using tr.
I sugges awk in this case since you may specify the literal multichar delimiter pattern as the input/output field separator, iterate over the fields and discard all those not matching your value:
mytext='\\[[123 one (/)\n\\[[456 two (/)\n\\[[789 three (/)'
myvar=456
awk -v myvar=$myvar 'BEGIN{FS=OFS="\\\\n"} {s="";
for (i=1; i<=NF; i++) {
if ($i !~ myvar) {s = s (i==1 ? "" : OFS) $i;}
}
} END{print s}' <<< "$mytext"
# => \\[[123 one (/)\\n\\[[789 three (/)
See the online awk demo.
NOTES:
BEGIN{FS=OFS="\\\\n"} - sets the input/output field separator to \n
-v myvar=$myvar passes the myvar to awk
s="" - assigns s to an empty string
for (i=1; i<=NF; i++) {...} - iterates over all fields
if ($i !~ myvar) {...} - if the current field value matches myvar...
s = s (i==1 ? "" : OFS) $i;} - append either the current field value to s (if is the first field) or output separator and the current field value (if it is not the first)
END{print s} - prints s after the field checks.

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

how to define a space in a regular expression (in awk)?

I want to print the texts inside of " ". for example I have the following strings:
gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj
And I want to print only the following:
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"
I have tried awk with the following regular expression:
awk '{for(i = 1; i <= NF; i++) if($i ~ /^\"[A-Za-z.$]*([A-Za-z.$][[:space:]]*[A-Za-z.$])*\"$/) print $i}' sample.txt
but it prints everything before space and actually does not recognize the spaces I have defined in my regular expression. My current output is:
"jkfgh"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj
as you can see, only the ones without any space are printed correctly.
I have also tried [[:blank:]], \t and also ' ' but did not work.
I appreciate if someone can tell me how to change this regular expression and include space.
The question's title is misleading and based on a fundamental misconception about awk.
The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk.
More generally, you can use [[:space:]] to match a space, a tab or a newline (GNU Awk also supports \s), and [[:blank:]] to match a space or a tab.
However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.
The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.
#fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.
If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):
re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt
This will work with single- and double-quoted strings.
Explanation:
FPAT="\"$re\"|'$re'" defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).
Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in $1, ... and NF.
Therefore, the loop for(i=1;i<=NF;++i) is already limited to enumerating only the matching fields.
Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).
If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:
gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...
You are just getting those without any space because you loop through fields and they are space separated. Thus, you need to change the approach to something handling the spaces differently. Assuming there are no nested quotes, you can use for example:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "\"%s\"", $i; print ""}' file
That is, use " as field separator and print the even fields.
This is equivalent to using FS more elegantly:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s", FS, $i, FS; print ""}' file
Note in the previous approaches the output has no space in between fields. If you need it, you can use:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
The trick (i>NF-2?"\n":" ") is a matter of printing the whole field together with a separator. If we are in the last field, we set it as new line; otherwise, as a space. More idiomatically, you can also say (i>NF-2?RS:OFS) using the default values of RS (record separator, new line) and OFS (output field separator, space).
Test
$ awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"