extract url part of the line - regex

I have html page with many lines and one of the line is:
var premium_download_link = 'http://www.someurl.com/';
how can I find that line inside html page and extract http://www.someurl.com from the line?

echo "var premium_download_link = 'http://www.someurl.com/'" | awk '{print substr ($4,2,23)}'

Using sed:
sed -n -e "s/.*var premium_download_link = '\([^']*\)';.*/\1/p"
The -n flag suppresses printing unless we explicitly print using p. Thus only matched (then substituted) lines are printed.
EDIT (based on OP comment):
To get this in a shell variable you might want something like:
url=$(wget -qO - "http://originalurl.com/" | sed -n -e "s/.*var premium_download_link = '\([^']*\)';.*/\1/p")
This fetches the page and runs it through sed. The output should be the url, which gets stored in a variable named url.

With awk :
awk -F "'" '{ for (f=1; f<=(NF-1)/2; f++) print $(f*2) }' $1
-F "'" define the quote ' as the separator for given input.

With awk you can extract specific field values by defining the field separator variable.
For instance, the following should work -
$ echo "var premium_download_link = 'http://www.someurl.com/';" |
awk -F"'" '{ print $2 }'
http://www.someurl.com/
However, your html file may have other content. So you can add a regex in front of the script to ensure that it runs only when the specific line is encountered.
For example -
awk -F"'" '/premium_download_link/{ print $2 }'

grep -Po "(?<=premium_download_link = ')[^']+"

Related

Grep password from .my.cnf

I try to grep the (write access) password from .my.cnf, but I didnt get it yet.
The .my.cnf looks like this:
# longer
# comment text
[clientreadonly]
password=pass1 # comment
port=3306
user=test_ro
socket=/var/lib/mysql/mysql.sock
[client]
password=pass2 # comment
port=3306
user=test
socket=/var/lib/mysql/mysql.sock
and I want to grep pass2. and the code shouldnt be too verbose of course. I ended up with
grep 'password=' ~/.my.cnf | sed -e 's/password=//
but thath actually leaves the #comment behind the pass2 and I dont want to replace the whole comment (because its long in the original and stupid to just replace it). So I would need a regex to somehow get the pass2 only.
The main target is, to grep the password so I can easily use it in a shell command line
$ awk -F'[= ]' '/^password=/ && p !~ /clientreadonly/{print $2} {p=$0}' ~/.my.cnf
pass2
-F'[= ]' use space or = as field separator
/^password=/ && p !~ /clientreadonly/ if line starts with password= and previous line doesn't contain clientreadonly
print $2 print the second field
p=$0 save the previous line in p variable
You can modify little bit like the following -
grep 'password=' ~/.my.cnf | sed -e 's/password=//' -e 's/ # comment//'
or other way -
grep 'password=' ~/.my.cnf | cut -d' ' -f1 | cut -d'=' -f2
Using perl:
perl -00 -ane '/\[client\].password=(\S+)/s && print $1' < ~/.my.cnf
Output:
pass2

Capture strings from several sets of quotes

been looking for a straight answer to this but not found anything within SO or wider searching that answers this simple question:
I have a string of quoted values, ip addresses in this case, that I want to extract individually to use as values elsewhere. I am intending to do this with sed and regex. The string format is like this:
"10.10.10.101","10.10.10.102","10.10.10.103"
I can capture the values between all quotes using regex such as:
"([^"]*)"
Question is how do I select each group separately so I can use them?
i.e.:
value1 = 10.10.10.101
value2 = 10.10.10.102
value3 = 10.10.10.103
I assume that I need three expressions but I cannot find how to select a specific occurance.
Apologies if its obvious but I have spent a while searching and testing with no luck...
You can try this bash:
$ str="10.10.10.101","10.10.10.102","10.10.10.103"
$ IFS="," arr=($str)
$ echo ${arr[1]}
10.10.10.102
If you have GNU awk, you can use FPAT to set the pattern for each field:
awk -v FPAT='[0-9.]+' '{ print $1 }' <<<'"10.10.10.101","10.10.10.102","10.10.10.103"'
Substitute $1 for $2 or $3 to print whichever value you want.
Since your fields don't contain spaces, you could use a similar method to read the values into an array:
read -ra ips < <(awk -v FPAT='[0-9.]+' '{ $1 = $1 }1' <<<'"10.10.10.101","10.10.10.102","10.10.10.103"')
Here, $1 = $1 makes awk reformat each line, so that the fields are printed with spaces in between.
Using grep -P you can use match reset:
s="10.10.10.101","10.10.10.102","10.10.10.103"
arr=($(grep -oP '(^|,)"\K[^"]*' <<< "$s"))
# check array content
declare -p arr
declare -a arr='([0]="10.10.10.101" [1]="10.10.10.102" [2]="10.10.10.103")'
If your grep doesn't support -P (PCRE) flag then use:
arr=($(grep -Eo '[.[:digit:]]+' <<< "$s"))
Here is an awk command that should work for BSD awk as well:
awk -F '"(,")?' '{for (i=2; i<NF; i++) print $i}' <<< "$s"

How to remove/strip double or single quote from a string?

I have a file with some lines like these:
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
I want to extract the parts after the = but without the surrounding quotes. I tried with gsub like this:
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"|'/, "", $2); print $2}'
Which ends up with -bash: syntax error near unexpected token ')' error. It works just fine for single matching: /"/ or /'/ but doesn't work when I try match either one. What am I doing wrong?
If you are just trying to remove the punctuation then you can do it as below....
# remove all punctuation
awk -F= '{print $2}' n.dat | tr -d [[:punct:]]
# only remove single and double quotes
awk -F= '{print $2}' n.dat | tr -d \''"\'
explanation:
tr -d \''"\' is to delete any single and double quotes.
tr -d [[:punct:]] to delete all character from the punctuation class
Sample output as below from 2nd command above (without quotes):
myenv
mydomain.net
mykeypem
The problem is not with awk, but with bash. The single quote inside the gsub is closing the open quote so that bash is trying to parse the command awk with arguments !/^...gsub(/"|/,, ,, $2 and then an unmatched close paren. Try replacing the single quote with '"'"' (so that bash will properly terminate the string, then apply a single quote, then reopen another string.)
Is awk really a requirement? If not, why don't you use a simple sed command:
sed -rn -e "s/^[^#]+='(.*)'$/\1/p" \
-e "s/^[^#]+=\"(.*)\"$/\1/p" \
-e "s/^[^#]+=(.*)/\1/p" data
This might seems over engineered, but it works properly with embedded quotes:
sh$ cat data
ENVIRONMENT="myenv"
ENV_DOMAIN='mydomain.net'
LOGIN_KEY=mykey.pem
PASSWD="good ol'passwd"
sh$ sed -rn -e "s/^[^#]+='(.*)'/\1/p" -e "s/^[^#]+=\"(.*)\"/\1/p" -e "s/^[^#]+=(.*)/\1/p" data
myenv
mydomain.net
mykey.pem
good ol'passwd
You can use awk like this:
awk -F "=['\"]?|['\"]" '{print $2}' file
myenv
mydomain.net
mykey.pem
This will work with your awk
awk -F= '!/^(#|$)/ && /^ENVIRONMENT=/ {gsub(/"/,"",$2);gsub(q,"",$2); print $2}' q=\' file
It is the single quote in the expression that create problems. Add it to an variable and it will work.
I did the following:
awk -F"=\"|='|'|\"|=" '{print $2}' file
myenv
mydomain.net
mykey.pem
This tells awk to use either =", =', ' or " as field separator.
This is because the awk program must be enclosed in single quotes when run as a command line program. The program can be tripped up if a single quote is contained inside the script. Special tricks can be made to use single quotes as strings inside the program. See Shell-Quoting Issues in the GNU Awk Manual.
One trick is to save the match string as a variable:
awk -F\= -v s="'|\"" '{gsub(s, "", $2); print $2}' file
Output:
myenv
mydomain.net
mykey.pem

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

SED replace expression "within" a regular expression

I have to change a CSV file column (the date) which is written in the following format:
YYYY-MM-DD
and I would like it to be
YYYY.MM.DD
I can write a succession of 2 sed rules piped one to the other like :
sed 's/-/./' file.csv | sed 's/-/./'
but this is not clean. my question is: is there a way of assigning variables in sed and tell it that YYYY-MM-DD should be parsed as year=YYYY ; month=MM ; day=DD and then tell it
write $year.$month.$day
or something similar? Maybe with awk?
You could use groups and access the year, month, and day directly via backreferences:
sed 's#\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\)#\1.\2.\3#g'
Here's an alternative solution with awk:
awk 'BEGIN { FS=OFS="," } { gsub("-", ".", $1); print }' file.csv
BEGIN { FS=OFS="," } tells awk to break the input lines into fields by , (variable FS, the [input] Field Separator), as well as to also use , when outputting modified input lines (variable OFS, the Output Field Separator).
gsub("-", ".", $1) replaces all - instances with . in field 1
The assumption is that the data is in the 1st field, $1; if the field index is a different one, replace the 1 in $1 accordingly.
print simply outputs the modified input line, terminated with a newline.
What you are doing is equivalent to supplying the "global" replacement flag:
sed 's/-/./g' file.csv
sed has no variables, but it does have numbered groups:
sed -r 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\1.\2.\3/g' file.csv
or, if your sed has no -r:
sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1.\2.\3/g' file.csv
You may try this sed command also,
sed 's/\([0-9]\{4\}\)\-\([0-9]\{2\}\)\-\([0-9]\{2\}\)/\1.\2.\3/g' file
Example:
$ (echo '2056-05-15'; echo '2086-12-15'; echo 'foo-bar-go') | sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1.\2.\3/g'
2056.05.15
2086.12.15
foo-bar-go