Bash/PHP extract URL from HTML via regex

Bash/PHP extract URL from HTML via regex - regex

Is there any easy way to extract this URL in bash/or PHP?
http://shop.image-site.com/images/2/format2013/fullies/kju_product.png
From this HTML code?
<a href="javascript: open_window_zoom('http://shop.image-site.com/image.php?image=http://shop.image-site.com/images/2/format2013/fullies/kju_product.png&pID=31777&download=kju.png&name=13011 KELLYS Kju: 490mm (19.5")',550,366);">

With perl you could do a match and a capture
perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);'
This captures everything between image= and the next & and prints it $1.
For more on regular expressions, see perlre or http://www.regular-expressions.info/

In bash, you can try the following:
sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g'
Update:
The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.

Whichever way you decide to dress it up, you could simply split with the delimiter equal to ?image= and then split the second token you receive (i.e. result[1]) with a simple & delimiter. The first result from that split is your answer.
However, a pure regex match would look something like: m#image=(a-z0-9\:/\.\-)&#i. You can take that regex and put it wherever you want to get your result stored in $1. Despite what a lot of people think, you do not have to match the beginning of a line and the end of a line to match a result.

Try doing this :
xmllint --html --xpath '//a/#href' file://file.html |
grep -oP 'image=\Khttp://.*?\.png'
You can use an URL instead of a local file :
http://domain.tld/path
Or if you had already extracted the line to parse in the $string variable :
grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

Related

How to find and replace a pattern string using sed/perl/awk?

I have a file foo.properties with contents like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
In my script, I need to replace whatever value is against ph (The current value is unknown to the bash script) and change it to 0.5. So the the file should look like
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
I know it can be easily done if the current value is known by using
sed "s/\,ph\:0.03\,/\,ph\:0.5\,/" foo.properties
But in my case, I have to actually read the contents against allNames and search for the value and then replace within a for loop. Rest all is taken care of but I can't figure out the sed/perl command for this.
I tried using sed "s/\,ph\:.*\,/\,ph\:0.5\,/" foo.properties and some variations but it didn't work.

A simpler sed solution:
sed -E 's/([=,]ph:)[0-9.]+/\10.5/g' file
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
Here we match ([=,]ph:) (i.e. , or = followed by ph:) and capture in group #1. This should be followed by 1+ of [0-9.] character to natch any number. In replacement we put \1 back with 0.5

With your shown samples, please try following awk code.
awk -v new_val="0.5" '
match($0,/,ph:[0-9]+(\.[0-9]+)?/){
val=substr($0,RSTART+1,RLENGTH-1)
sub(/:.*/,":",val)
print substr($0,1,RSTART) val new_val substr($0,RSTART+RLENGTH)
next
}
1
' Input_file
Detailed Explanation: Creating awk's variable named new_val which contains new value which needs to put in. In main program of awk using match function of awk to match ,ph:[0-9]+(\.[0-9]+)? regex in each line, if a match of regex is found then storing that matched value into variable val. Then substituting everything from : to till end of value in val variable with : here. Then printing values as pre requirement of OP(values before matched regex value with val(edited matched value in regex) with new value and rest of line), using next will avoid going further and by mentioning 1 printing rest other lines which are NOT having a matched value in it.
2nd solution: Using sub function of awk.
awk -v newVal="0.5" '/^allNames=/{sub(/,ph:[^,]*/,",ph:"newVal)} 1' Input_file

Would you please try a perl solution:
perl -pe '
s/(?<=\bph:)[\d.]+(?=,|$)/0.5/;
' foo.properties
The -pe option makes perl to read the input line by line, perform
the operation, then print it as sed does.
The regex (?<=\bph:) is a zero-length lookbehind which matches
the string ph: preceded by a word boundary.
The regex [\d.]+ will match a decimal number.
The regex (?=,|$) is a zero-length lookahead which matches
a comma or the end of the string.
As the lookbehind and the lookahead has zero length, they are not
substituted by the s/../../ operator.
[Edit]
As Dave Cross comments, the lookahead (?=,|$) is unnecessary as long as the input file is correctly formatted.

Works with decimal place or not, or no value, anywhere in the line.
sed -E 's/(^|[^-_[:alnum:]])ph:[0-9]*(.[0-9]+)?/ph:0.5/g'
Or possibly:
sed -E 's/(^|[=,[:space:]])ph:[0-9]+(.[0-9]+)?/ph:0.5/g'
The top one uses "not other naming characters" to describe the character immediately before a name, the bottom one uses delimiter characters (you could add more characters to either). The purpose is to avoid clashing with other_ph or autograph.

Here you go
#!/usr/bin/perl
use strict;
use warnings;
print "\nPerl Starting ... \n\n";
while (my $recordLine =<DATA>)
{
chomp($recordLine);
if (index($recordLine, "ph:") != -1)
{
$recordLine =~ s/ph:.*?,/ph:0.5,/g;
print "recordLine: $recordLine ...\n";
}
}
print "\nPerl End ... \n\n";
__DATA__
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.03,delta:1.0,gamma:.5
output:
Perl Starting ...
recordLine: allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5 ...
Perl End ...

Using any sed in any shell on every Unix box (the other sed solutions posted that use sed -E require GNU or BSD seds):
a) if ph: is never the first tag in the allNames list (as shown in your sample input):
$ sed 's/\(,ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5
b) or if it can be first:
$ sed 's/\([,=]ph:\)[^,]*/\10.5/' foo.properties
foo=bar
# another property
test=true
allNames=alpha:.02,beta:0.25,ph:0.5,delta:1.0,gamma:.5

Use "sed" to Remove Capture Group 1 From All Lines In a File

I currently have a file with lines like the below:
ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z
My goal is to remove everything from the "#" to the next comma, such that it instead looks like the below:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
I'm not that experienced with utilizing sed and RegEx expressions. In playing around on a testing website, I came up with the below RegEx string, in which capture group 1 is perfectly matching to what I want to remove:
regex101.com Test
How would I go about putting this in a "sed" command against a given input file, and writing the results to a new output file. I had tried the below most recently:
sed 's/(#.+?),//' input.csv > input_Corrected.csv
Just as another note, I'm doing this in a bash script in which I have an API call generating the "input.csv" file, and then want to run this sed command to clean up the data format to match my needs.

You can use
sed 's/#[^,]*,/,/' input.csv > input_Corrected.csv
sed 's/#[^,]*//' input.csv > input_Corrected.csv
The #[^,]*, POSIX BRE pattern matches a # and then any zero or more chars other than , and then a , (in the first example, use it if there MUST be a comma after the match) and replaces with a comma (in the first example, keep the replacement empty if you use the second approach).
See the online demo:
s='ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z'
sed 's/#[^,]*,/,/' <<< "$s"
Output:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z

You can used the below regular expression in order to remove the content of the valid email address only.
sed "s/#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})//g" input.csv > input_Corrected.csv
And as per your requirement you can use the below code. As it is going to replace all the email address on the file as you have on your file "calvin_hobbes2#netnet" which is not valid email address.
sed "s/#[^,]*//g" input.csv > input_Corrected.csv

pattern match and add line at the end or start of a line in a text file using sed

I have a text file which contains:
First link https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_f304e840-bb1d-4bcf-a993-d966c0b99ae3.jpeg?v=1452842355
Second link https://cdn.shopify.com/s/files/1/0151/0741/products/549542c704da78a0e5208b9f8c2cd26e.jpeg?v=1452842263
Third link https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_70e7e6b9-bedd-40a7-b322-542facf94c05.jpeg?v=1452842230
Fourth link https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_5485fd04-c852-4fd7-b142-92595329568a.jpeg?v=1452841841
lst link https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_fb613b45-fbbb-4b6d-b9c0-45d7f069879e.jpeg?v=1452841831
I want to match last url and append a word at start or end of the line using sed.
But it is not working. HELP
output of the command gives this error.
$sed -e 's_https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_f304e840-bb1d-4bcf-a993-d966c0b99ae3.jpeg\?v=1452842355 .*_& NOTFOUND_'
sed: -e expression #1, char 148: unknown option to `s'

Unfortunately sed is the not the best tool for this task. There is no way you can pass a plain non-regex string in a sed pattern without doing all the escaping before hand.
Better to use awk for this:
awk 'index($0, "https://cdn.shopify.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_fb613b45-fbbb-4b6d-b9c0-45d7f069879e.jpeg?v=1452841831"){
$0 = $0 " NOTFOUND"} 1' file
index function just searched for presence of given URL in a record and if found appends " NOTFOUND string at the end.
Equivalent working sed would be this:
sed 's~https://cdn\.shopify\.com/s/files/1/0151/0741/products/2c60070615ceaa44c934ca876fe4ccc0_fb613b45-fbbb-4b6d-b9c0-45d7f069879e\.jpeg?v=1452841831.*~& NOTFOUND~' file
As you can see it requires you to escape all the DOTs and pick a regex delimiter which is not already present in input string.

Why are you using _ as your regex delimiter, when that char shows up in the URLs?
[..snip..]/products/2c60070615ceaa44c934ca876fe4ccc0_fb613b45-fb
^---
You're effectively doing
s/.../f
and f is an unknown modifier for an s/ regex.

The pattern has an underscore ...fe4ccc0_f304... which you used as the delimiter for the substitute command. use some other delimiter that does not appear unescaped in the pattern or replacement string.
Try using | character instead, as in s|http://... .*$|& NOT_FOUND|.

Find all text within square brackets using regex

I have a problem that because of PHP version, I need to change my code from $array[stringindex] to $array['stringindex'];
So I want to find all the text using regex, and replace them all. How to find all strings that look like this? $array[stringindex].

Here's a solution in PHP:
$re = "/(\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\])/";
$str = "here is \$array[stringindex] but not \$array['stringindex'] nor \$3array[stringindex] nor \$array[4stringindex]";
$subst = "$1'$2'$3";
$result = preg_replace($re, $subst, $str);
You can try it out interactively here. I search for variables beginning with a letter, otherwise things like $foo[42] would be converted to $foo['42'], which might not be desirable.
Note that all the solutions here will not handle every case correctly.
Looking at the Sublime Text regex help, it would seem you could just paste (\\$[[:alpha:]][[:alnum:]]+\\[)([[:alpha:]][[:alnum:]]+)(\\]) into the Search box and $1'$2'$3 into the Replace field.

It depends of the tool you want to use to do the replacement.
with sed for exemple, it would be something like that:
sed "s/\(\$array\)\[\([^]]*\)\]/\1['\2']/g"

If sed is allowed you could simply do:
sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" file
Explanation:
sed "s/pattern/replace/g" is a sed command which searches for pattern and replaces it with replace. The g options means replace multiple times per line.
(\$[^[]*[)([^]]*)] this pattern consists of two groups (in between brackets). The first is a dollar followed by a series of non [ chars. Then an opening square bracket follows, followed by a series of non closing brackets which is then followed by a closing square bracket.
\1'\2'] the replacement string: \1 means insert the first captured group (analogous for \2. Basically we wrap \2 in quotes (which is what you wanted).
the -i options means that the changes should be applied to the original file, which is supplied at the end.
For more information, see man sed.
This can be combined with the find command, as follows:
find . -name '*.php' -exec sed -i "s/(\$[^[]*[)([^]]*)]/\1'\2']/g" '{}' \;
This will apply the sed command to all php files found.

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!

When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern

You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.

By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.

The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.

As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.

As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...

here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash/PHP extract URL from HTML via regex - regex

With perl you could do a match and a capture perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);' This captures everything between image= and the next & and prints it $1. For more on regular expressions, see perlre or http://www.regular-expressions.info/

In bash, you can try the following: sed 's/.image=\(http:\/\/[^&]\).*/\1/g' Update: The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.

Try doing this : xmllint --html --xpath '//a/#href' file://file.html | grep -oP 'image=\Khttp://.?\.png' You can use an URL instead of a local file : http://domain.tld/path Or if you had already extracted the line to parse in the $string variable : grep -oP 'image=\Khttp://.?\.png' <<< "$string"

Related

How to find and replace a pattern string using sed/perl/awk?

Use "sed" to Remove Capture Group 1 From All Lines In a File

pattern match and add line at the end or start of a line in a text file using sed

Find all text within square brackets using regex

Is there a truly universal wildcard in Grep? [duplicate]

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash/PHP extract URL from HTML via regex - regex

With perl you could do a match and a capture perl -n -e 'print "$1\n" if (m/image=(.*?)\&/);' This captures everything between image= and the next & and prints it $1. For more on regular expressions, see perlre or http://www.regular-expressions.info/

In bash, you can try the following: sed 's/.*image=\(http:\/\/[^&]*\).*/\1/g' Update: The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.

Try doing this : xmllint --html --xpath '//a/#href' file://file.html | grep -oP 'image=\Khttp://.*?\.png' You can use an URL instead of a local file : http://domain.tld/path Or if you had already extracted the line to parse in the $string variable : grep -oP 'image=\Khttp://.*?\.png' <<< "$string"

Related

How to find and replace a pattern string using sed/perl/awk?

Use "sed" to Remove Capture Group 1 From All Lines In a File

pattern match and add line at the end or start of a line in a text file using sed

Find all text within square brackets using regex

Is there a truly universal wildcard in Grep? [duplicate]

Categories

Resources

In bash, you can try the following: sed 's/.image=\(http:\/\/[^&]\).*/\1/g' Update: The solution above performs substitution rather than extraction. The line containing the pattern (required url) is replaced by the pattern itself. However, the substitution isn't in-place.

Try doing this : xmllint --html --xpath '//a/#href' file://file.html | grep -oP 'image=\Khttp://.?\.png' You can use an URL instead of a local file : http://domain.tld/path Or if you had already extracted the line to parse in the $string variable : grep -oP 'image=\Khttp://.?\.png' <<< "$string"