Concatenate urls based on result of two columns - regex

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz

Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt

If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.

In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file

This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Related

How can i set the delimiter as an argument in the match() function?

awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
I want to understand the match clause and want to know how can I make it dynamic so that it can take the delimiter as an argument instead of hard-coding it to comma.
I tried this, but it didnt work as i dont have the background for this function.
awk -v dl '{while(match($0,/("[^"]+"dl|[^,]*dl|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
Input File data:
a,b,c,"d,e,f",
"a,b",c,d,"e,f",
p,q,r,"s,u",
Desired output (may be 4th field):
d,e,f
e,f
s,u
Desired output (may be 5th field, so it should generate the 3 rows with blank value):
Here , Delimiter can be anything comma, pipe and desired field number is also dynmaic.. thats why i wanted to pass the argument for field number and delimiter..
Field number argument is working fine but not the delimiter argument?
As suggested by Anubhava, i used that fpat which works really fine but it is not giving any rows when fetching the column 5th form the input file?
Using gnu-awk, you can define a FPAT variable that is a regular expression for matching fields.
awk -v FPAT='"[^"]*"|[^,]*' '{gsub(/"/, "", $4); print $4}' file
d,e,f
e,f
s,u
Running it from a shell script that takes delimiter as argument:
dl="${1?}"
awk -v FPAT='"[^"]*"|[^'"$dl"']*' '{gsub(/"/, "", $4); print $4}' "${2?}"
Then run it as:
bash p.sh ',' 'file'
That regex was strange, so I'll rewritten it. The regex:
/("[^"]+",|[^,]*,|([^,]+$))/
the "[^"]+" is parsed like - first " and last " are quotation marks and [^"]+ matches everything except quotes. So it's the same as:
"([^\"]+,|[^,]*,|([^,]+$))"
I guess that you want to match a field [^,]+ or a quoted field \"[^\"]+\" followed by a delimiter or end of line (,|$). So match that. And in matching groups match the insides of the fields, so match \"([^\"]+)\" or the unquoted field ([^,]+) and then use those matching groups
awk -v dl=, '{
x = 0;
while (match($0, "^(\"([^\"]+)\"|([^" dl "]+))(" dl "|$)", a)) {
$0 = substr($0, RSTART + RLENGTH);
b[++x] = a[2] a[3]; # funny, one of them will be empty
}
print b[4];
}' <<EOF
a,b,c,"d,e,f"
"a,b",c,d,"e,f"
p,q,r,"s,u"
EOF
d,e,f
e,f
s,u

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

awk - parse text having same character in fields as delimiter

Consider this source:
field1;field2;"data;data field3";field4;"data;data field5";field6
field1;"data;data field2";field3;field4;field5;"data;data field6"
As you can see, the field delimiter is being used inside certain fields, enclosed between ". I cannot directly parse with awk because there is no way of avoiding unwanted splitting, at least I haven't found a way. Moreover, those special fields have a variable position within a line and they can occur once, twice, 4 times etc.
I thought of a solution involving a pre-parsing step, where I replace the ; in those fields with a code of some sort. The problem is that sed / awk perform greedy REGEX match. So in the above example, I can only replace ; within the last field enclosed in quotes in each line.
How can I match each instance of quotes and replace the specific ; within them? I do not want to use perl or python etc.
Using gnu awk you can use special FPAT variable to have a regex for your fields.
You can use this command to replace all ; by | inside the double quotes:
awk -v OFS=';' -v FPAT='"[^"]*"|[^;]*' '{for (i=1; i<=NF; i++) gsub(/;/, "|", $i)} 1' file
field1;field2;"data|data field3";field4;"data|data field5";field6
field1;"data|data field2";field3;field4;field5;"data|data field6"
As an alternative to FPAT you can set the awk FS to be double quotes and then swap out your semicolon delimiter for every other field:
awk -F"\"" '{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/;/, "|", $i)}} {print $0}' yourfile
Here awk is:
Splitting the record by double quote (-F"\"")
Looping through each field that it finds ({for(i=1;i<=NF;++i))
Testing the field ordinal's mod 2 if it's 0 (if(i%2==0))
If it's even then it swaps out the semicolons with pipes (gsub(/;/, "|", $i))
Prints out the transformed record ({print $0})

regular expression that use th previous pattern with awk

This is content of my log file:
INFO consume_end_processor: user:bbbb callee_num:+23455539764806 sid:I374uribbbbb151101030212130 duration:0 result:ok provider:sipouthh.ym.ms
INFO consume_processor: user:bbbb callee_num:+23455539764806 sid:<<"A28udestaniephillips52x151031185754827">> duration:0 result:ok provider:sipouthh.ym.ms
and I need to extract the content from :
sid:<<"A28udestaniephillips52x151031185754827">>
sid:A28udestaniephillips52x151031185754827
like A28udestaniephillips52x151031185754827
My answer is awk '/(?<=sid)^[A-Z]+\/{print $8 }', however this is wrong and I am not sure how to fix it.
How can I write my regular expression in awk in order to extract just this part of information.
Thank you for any help.
$ awk '{ sub(/^sid:(<<")?/,"",$5); sub(/">>$/, "", $5); print $5}' log.txt
I374uribbbbb151101030212130
A28udestaniephillips52x151031185754827
Here we are simply using sub to remove (by replacing with an empty string) the parts of the 5th field that we don't want.
The first sub removes the leading sid:, that may optionally be followed by <<".
The second sub removes a trailing ">>. Note that if there is no trailing ">>, then the sub does nothing and is harmless.
$ awk '{gsub(/sid:(<<")?|">>/,"",$5); print $5}' file
I374uribbbbb151101030212130
A28udestaniephillips52x151031185754827

How to fetch the matched items using awk and regexp?

I am trying to parse "/boot/grub/grubenv" but really not very good at regexp.
Suppose the content of /boot/grub/grubenv is:
saved_entry=1
I want to output the number "1", like below. I am currently using "awk", but open to other tools.
$ awk '/^(saved_entry=)([0-9]+)/ {print $2}' /boot/grub/grubenv
But obviously not working, thanks for the help.
Specify a field separator with -F option:
awk -F= '/^saved_entry=/ {print $2}' /boot/grub/grubenv
$1, $2, .. here represents fields (separated by =), not a backreferences to captured groups.
If you want to match things probably best to use match!
This will work even if there are more fields after and does not need you to change the field separator(incase you are doing any other stuff with the data).
The only drawback of this method is that it will only match the left-most match of the record, so if the data appears twice in the same record(line) then it will only match the first one it finds.
awk 'match($0,/^(saved_entry=)([0-9]+)/,a){print a[2]}' file
Example
input
saved_entry=1 blah blah more stuff
output
1
Explanation
Matches the regex in $0(the record) and then stores anything in brackets as separate array elements.
From the example, there would be these outputs
a[0] is saved_entry=1
a[1] is saved_entry=
a[2] is 1