How to extract a part of the url through regular expression in textwrangler? - regex

My work involves manipulating lots of data. I use textwrangler as text editor but I guess the things would remain the same on all text editors.
So I have a url
http://example.com/swatches/frisk-watches/pr?p[]=sort%3Dpopularity&sid=812%2Cf13&offer=GsdOfferOnWatches07.&ref=4c83d65f-bfaf-4db6-b5f5-d733d7b1d2af
The above one is a sample url
I want to capture the text GsdOfferOnWatches07. i.e text from offer= and till &ref using regular expression on textwragler Ctrl+F feature.
How can I do that?

$link = 'http://example.com/swatches/frisk-watches/pr?p[]=sort%3Dpopularity&sid=812%2Cf13&offer=GsdOfferOnWatches07.&ref=4c83d65f-bfaf-4db6-b5f5-d733d7b1d2af';
preg_match('/offer=(.*?)&ref/', $link, $match);
echo $match[1];'

Related

Using multiple Perl regular expressions to find and replace

I'm a Perl and regex newcomer in need of your expertise.
I need to process text files that include placeholder lines like Foo Bar1.jpg and replace those with with corresponding URLs like https:/baz/qux/Foo_Bar1.jpg.
As you may have guessed, I'm working with HTML. The placeholder text refers to the filename, which is the only thing available when writing the document. That's why I have to use placeholder text. Ultimately, of course, I want to replace the filename with the URL (after I upload file to my CMS to get the URL). At that point, I have all of the information at hand — the filename and the URL. Of course, I could just paste the URLs over the placeholder names in the HTML document. In fact, I've done that. But I'm certain that there's a better way.
In short, I have placeholder lines like this:
Foo Bar1.jpg
Foo Bar2.jpg
Foo Bar3.jpg
And I also have URL lines like this:
https:/baz/qux/Foo_Bar1.jpg
https:/baz/qux/Foo_Bar2.jpg
https:/baz/qux/Foo_Bar3.jpg
I want to find the placeholder string and capture a differentiator like Bar1 with a regex. Then I want to use the captured part like Bar1 to perform another regex search that matches part of the corresponding URL string, i.e. https:/baz/qux/Foo_Bar1.jpg. After a successful match, I want to replace the Foo Bar1.jpg line with https:/baz/qux/Foo_Bar1.jpg.
Ultimately, I want to do that for every permutation, so that https:/baz/qux/Foo_Bar2.jpg also replaces Foo Bar2.jpg and so on.
I've written regular expressions that match both the placeholder and the URL. That's not my problem, as far as I can tell. I can find the strings I need to process. For example, /[a-z]+\s([a-z0-9]+)\.jpg/ successfully matches what I'm calling the placeholder text and captures what I'm calling the differentiator.
However, though I've spent an embarrassing number of hours over the past week reading through Stack Overflow, various other sites and O'Reilly books on Pearl and Pearl Regular Expressions, I can't wrap my mind around how to process what I can find.
I think the piece you are missing is the idea of using Perl's internal grep function, for searching a list of URL lines based on what you are calling your "differentiator".
Slurp your URL lines into a Perl array (assuming there are a finite manageable number of them, so that memory is not clobbered):
open URLS, theUrlFile.txt or die "Cannot open.\n";
my #urls = <URLS>;
Then within the loop over your file containing "placeholders":
while (my $key = /[a-z]+\s([a-z0-9]+)\.jpg/g) {
my #matches = grep $key, #urls;
if (#matches) {
s/[a-z]+\s$key\.jpg/$matches[0]/;
}
}
You may also want to insert error/warning messages if #matches != 1.

remove multiple style tags using regular expression in C#

I have the following html source which consists of two style tags, using regular expressions we are able to remove all the html tags from the file,but the we are not able to remove the content of second style tag
<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>
C# Code
1) Regex test = new Regex(#"<[^\>]*>{}");
2) strText = test.Replace(strText, String.Empty);
Output:-
1) Expected is blank but we get P {margin-top:0;margin-bottom:0;}
Do you want to remove the style tag?
<style.*?</style>
I would not generally recommend using regex to match HTML/XML unless you are sure that it always has a certain structure. There is better tools for manipulating XML.
but i want the attributes/values of style tag also to be removed
You can try with back reference that matches the same text as previously matched by a capturing group.
To remove everything inside the <...> to </...> use below regex that looks for same opening and closing HTML tags.
<(\w+)[^>]*>.*<\/\1>
Captured Group 1-----^^^ ^^----- Back Reference first matched group
Here is demo

Find/Replace regex to remove html tags

Using find and replace, what regex would remove the tags surrounding something like this:
<option value="863">Viticulture and Enology</option>
Note: the option value changes to different numbers, but using a regular expression to remove numbers is acceptable
I am still trying to learn but I can't get it to work.
I'm not using it to parse HTML, I have data from one of our company websites that we need in excel, but our designer deleted the original data file and we need it back. I have a list of the options and need to remove the HTML tags, using Notepad++ to find and replace
This works for me Notepad++ 5.8.6 (UNICODE)
search : <option value="\d+">(.*?)</option>
replace : $1
Be sure to select "Regular expression" and ". matches newline"
I have done by using following regular expression:
Find this : <.*?>|</.*?>
and
replace with : \r\n (this for new line)
By using this regular expression (<.*?>|</.*?>) we can easily find value between your HTML tags like below:
I have input:
<otpion value="123">1</option><otpion value="1234">2</option><otpion value="1235">3</option><otpion value="1236">4</option><otpion value="1237">5</option>
I need to find values between options like 1,2,3,4,5
and got below output :
This works perfectly for me:
Select "Regular Expression" in "Find" Mode.
Enter [<].*?> in "Find What" field and leave the "Replace With" field empty.
Note that you need to have version 5.9 of Notepad++ for the ? operator to work.
as found here:
digoCOdigo - strip html tags in notepad++
Something like this would work (as long as you know the format of the HTML won't change):
<option value="(\d+)">(.+)</option>
String s = "<option value=\"863\">Viticulture and Enology</option>";
s.replaceAll ("(<option value=\"[0-9]+\">)([^<]+)</option>", "$2")
res1: java.lang.String = Viticulture and Enology
(Tested with scala, therefore the res1:)
With sed, you would use a little different syntax:
echo '<option value="863">Viticulture and Enology</option>'|sed -re 's|(<option value="[0-9]+">)([^<]+)</option>|\2|'
For notepad++, I don't know the details, but "[0-9]+" should mean 'at least one digit', "[^<]" anything but a opening less-than, multiple times. Masking and backreferences may differ.
Regexes are problematic, if they span multiple lines, or are hidden by a comment, a regex will not recognize it.
However, a lot of html is genereated in a regex-friendly way, always fitting into a line, and never commented out. Or you use it in throwaway code, and can check your input before.

Regular expression for changing links in Dreamweaver

I'm in the process of moving my Dreamweaver-based website to a CMS, and I would like to replace site-wide the following kind of links:
a href="http://www.domain.com/category/item ### title.html" (where ### is a number)
to
a href="http://www.domain.com/category/item###"
What is the correct regular expression I should use in the find and replace built-in engine?
I propose
'(http://www.domain.com/category/item) *(\d+).+?\.html'
as RE chain
and to substitute the entire match with $1 + $2

Regular Expression to extract src attribute from img tag

I am trying to write a pattern for extracting the path for files found in img tags in HTML.
String string = "<img src=\"file:/C:/Documents and Settings/elundqvist/My Documents/My Pictures/import dialog step 1.JPG\" border=\"0\" />";
My Pattern:
src\\s*=\\s*\"(.+)\"
Problem is that my pattern will also include the 'border="0" part of the img tag.
What pattern would match the URI path for this file without including the 'border="0"?
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
This one only grabs the src only if it's inside of an tag and not when it is written anywhere else as plain text. It also checks if you've added other attributes before or after the src attribute.
Also, it determines whether you're using single (') or double (") quotes.
\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>
So for PHP you would do:
preg_match("/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/", $string, $matches);
echo "$matches[1]";
for JavaScript you would do:
var match = text.match(/\<img.+src\=(?:\"|\')(.+?)(?:\"|\')(?:.+?)\>/)
alert(match[1]);
Hopefully that helps.
Try this expression:
src\s*=\s*"([^"]+)"
I solved it by using this regex.
/<img.*?src="(.*?)"/g
Validated in https://regex101.com/r/aVBUOo/1
You want to play with the greedy form of group-capture. Something like
src\\s*=\\s*\"(.+)?\"
By default the regex will try and match as much as possible
I am trying to write a pattern for extracting the path for files found in img tags in HTML.
Can we have an autoresponder for "Don't use regex to parse [X]HTML"?
Problem is that my pattern will also include the 'border="0" part of the img tag.
Not to mention any time 'src="' appears in plain text!
If you know in advance the exact format of the HTML you're going to be parsing (eg. because you generated it yourself), you can get away with it. But otherwise, regex is entirely the wrong tool for the job.
I'd like to expand on this topic as usually the src attribute comes unquoted so the regex to take the quoted and unquoted src attribute is:
src\s*=\s*"?(.+?)["|\s]