How can i set the delimiter as an argument in the match() function? - regex

awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
I want to understand the match clause and want to know how can I make it dynamic so that it can take the delimiter as an argument instead of hard-coding it to comma.
I tried this, but it didnt work as i dont have the background for this function.
awk -v dl '{while(match($0,/("[^"]+"dl|[^,]*dl|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
Input File data:
a,b,c,"d,e,f",
"a,b",c,d,"e,f",
p,q,r,"s,u",
Desired output (may be 4th field):
d,e,f
e,f
s,u
Desired output (may be 5th field, so it should generate the 3 rows with blank value):
Here , Delimiter can be anything comma, pipe and desired field number is also dynmaic.. thats why i wanted to pass the argument for field number and delimiter..
Field number argument is working fine but not the delimiter argument?
As suggested by Anubhava, i used that fpat which works really fine but it is not giving any rows when fetching the column 5th form the input file?

Using gnu-awk, you can define a FPAT variable that is a regular expression for matching fields.
awk -v FPAT='"[^"]*"|[^,]*' '{gsub(/"/, "", $4); print $4}' file
d,e,f
e,f
s,u
Running it from a shell script that takes delimiter as argument:
dl="${1?}"
awk -v FPAT='"[^"]*"|[^'"$dl"']*' '{gsub(/"/, "", $4); print $4}' "${2?}"
Then run it as:
bash p.sh ',' 'file'

That regex was strange, so I'll rewritten it. The regex:
/("[^"]+",|[^,]*,|([^,]+$))/
the "[^"]+" is parsed like - first " and last " are quotation marks and [^"]+ matches everything except quotes. So it's the same as:
"([^\"]+,|[^,]*,|([^,]+$))"
I guess that you want to match a field [^,]+ or a quoted field \"[^\"]+\" followed by a delimiter or end of line (,|$). So match that. And in matching groups match the insides of the fields, so match \"([^\"]+)\" or the unquoted field ([^,]+) and then use those matching groups
awk -v dl=, '{
x = 0;
while (match($0, "^(\"([^\"]+)\"|([^" dl "]+))(" dl "|$)", a)) {
$0 = substr($0, RSTART + RLENGTH);
b[++x] = a[2] a[3]; # funny, one of them will be empty
}
print b[4];
}' <<EOF
a,b,c,"d,e,f"
"a,b",c,d,"e,f"
p,q,r,"s,u"
EOF
d,e,f
e,f
s,u

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

awk: Use gensub to substitute multiple lines from a paragraph record

I have an input file with multiple paragraphs separated by at least two newlines (\n\n), and I'm wanting to extract fields from lines within certain paragraphs. I think the processing will be simplest if I can get gensub to work as I'm hoping. Considering the following input file:
[Record R1]
Var1=0
Var2=20
Var3=5
[Record R2]
Var1=10
Var3=9
Var4=/var/tmp/
Var2=12
[Record R3]
Var1=2
Var3=5
Var5=19
I want to print only the value of Var2 from records R1 and R3 (where Var2 doesn't actually exist). I can easily group all of the variables into their corresponding record by setting RS="\n\n", then they are all contained within $0. But since I don't know where it will appear it the list ahead of time, I want to use something like gensub to extract it. This is what I have going:
awk '
BEGIN {
RS="\n\n"
}
/Record R1/ || /Record R3/ {
print gensub(/[\n.]*Var2=(.*)[\n.]*/, "\\1", "g", $0)
}
' /tmp/input.txt
But instead of only printing 20 (the value of Var2 from R1), it prints the following:
[Record R1]
Var1=0
20
Var3=5
[Record R3]
Var1=2
Var3=5
Var5=19
The intent is that the regex in the gensub command would capture all characters (newlines: \n; and non-newlines: .) before and after Var2=XX and replace everything with XX. But instead, it's only capturing the characters on the same line as Var2=XX. Can awk's gensub do this kind of multi-line substitution?
I know an alternative would be to loop over all the fields in the record, the split the field that matches Var2= on the = sign, but that feels less efficient as I scale this out to multiple variables.
I don't understand what it is you're trying to do with gensub() but to do what you seem to be trying to do in any awk is:
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[12]$/) print f["Var2"]; delete f}' file
20
12
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[13]$/) print f["Var2"]; delete f}' file
20
gensub() doesn't care if the string it's operating on is one line or many lines btw - \n is just one more character, no different from any other character.
Oh, hang on, now I see what you're thinking with that gensub() - your problems are:
[\n.]* means zero or more newlines or periods but you don't have
any periods in your input so it's the same as \n* but you don't have any newlines immediately before a Var2
Var2 doesn't exist in your 2nd records so the regexp can't match it.
The (.*) will match everything to the end of the record (leftmost longest matches).
The "g" is misleading since you only expect 1 match.
So using gensub() on multi-line text isn't an issue, your regexps just wrong.
another awk
$ awk -v RS= '/\[Record R[13]\]/{for(i=2;i<=NF;i++)
{v=sub(/ *Var2=/,"",$i);
if(v) print $i}}' file
20

Replace non-alphanumeric characters in substring

I am trying to replace any non-alphanumeric characters present in the first part (before the = sign) of a bunch of key value pairs, by a _:
Input
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
Desired Output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
I have tried patterns such as: s/\([^a-zA-Z]*\)=\(.*\)/\1=\2/g without much success. Any basic GNU/Linux tools can probably be used.
With awk
$ awk -F= -v OFS='=' '{gsub("[^a-zA-Z]", "_", $1)} 1' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Input and output field separators are set to = and then gsub("[^a-zA-Z]", "_", $1) will substitute all non-alphabet characters with _ only for first field
With perl
$ perl -pe 's/^[^=]+/$&=~s|[^a-z]|_|gir/e' ip.txt
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
^[^=]+ non = characters from start of line
$&=~s|[^a-z]|_|gir replace non-alphabet characters with _ only for the matched portion
Use perl -i -pe for inplace editing
Assuming your input is in a file called infile, you could do this:
while IFS== read key value; do
printf '%s=%s\n' "${key//[![:alnum:]]/_}" "${value}"
done < infile
with the output
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
This sets the IFS variable to = and reads your key/value pairs line by line into a key and a value variable.
The printf command prints them and adds the = back in; "${key//[![:alnum:]]/_}" substitutes all non-alphanumeric characters in key by an underscore.
Any Posix compliant awk
$ cat f
aa:cc:dd=foo-bar|17657V70YPQOV
ee-ff/gg=barFOO
$ awk 'BEGIN{FS=OFS="="}gsub(/[^[:alnum:]]/,"_",$1)+1' f
aa_cc_dd=foo-bar|17657V70YPQOV
ee_ff_gg=barFOO
Explanation
BEGIN{FS=OFS="="} Set input and Output field separator =
/[^[:alnum:]]/ Match a character not present in the list,
[:alnum:] matches a alphanumeric character [a-zA-Z0-9]
gsub(REGEXP, REPLACEMENT, TARGET)
This is similar to the sub function, except gsub replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g in gsub stands for global, which means replace everywhere,The gsub function returns the number
of substitutions made
+1 It takes care of default operation {print $0} whenever gsub returns 0
Thought I would throw a little ruby in:
ruby -pe '$_.sub!(/[^=]+/){|m| m.gsub(/[^[:alnum:]]/,"_")}'

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.