Flickr photo URL replacement regular expression - regex

I'm trying to write a regular expression to modify URLs stored in a database (linking to photos on Flickr) so I can change the size of photos already on a site -
E.g. Replace: 4724575242_ca7d120609.jpg with 4724575242_ca7d120609_z.jpg in a URL such as:
http://farm2.static.flickr.com/1045/4724575242_ca7d120609.jpg
The only change is to add _z before the .jpg extension.
I imagined that a regular expression could be written which matches static.flickr.com then replace .jpg with _z.jpg but unfortunately my attempts have so far failed.
I wondered if any regex ninjas out there might be able to help me with this?
Any help would be greatly appreciated - David

Maybe this one (adapting it to your language, of course) ?
/static\.flickr\.com\/([a-z0-9_\/]+)\.jpg$/static.flickr.com\/$1_z.jpg/

Related

Regular expression to exclude part of a url?

I'm trying to create a regular expression for google analytics goals.
I need to match either of these 2 url fragments:
/order/map/egw/?code=somevalue
or
/order/map/egw/
But NOT this url:
/order/map/egw/consult/
Tried this:
/order/map/egw/$ | /order/map/egw/\?
and other variations but can't get it to match properly
Fast help greatly appreciated!
How about this regular expression?
/order/map/egw/(?!consult).*
If in the future you find that there's another sub-directory that you don't want to include, you can add a new one (e.g. the sub-directory 'wrong') like so:
/order/map/egw/(?!consult|wrong).*
What about this? I don't know how strict you're trying to be but it should work for your use cases:
(?!.*consult)/order/map/egw/(\?.+)?
It ensures "consult" is not found in the URL and matches the base part with an optional query string.

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

search & replace wordpress video shortcode with plain URL using regular expressions

i am transferring a friend's wordpress.com blog to a self-hosted install on my server. problem is, he has many videos embedded in his blog using a shortcode plugin that is not necessary on wordpress 3 (you need only to paste the plain URL to embed videos from YouTube, Vimeo, etc;
I've found a Search Regex plugin that will search & replace using regular expressions, but am unfamiliar with regex myself. how might i catch the url in a shortcode such as [youtube="URL"] and replace it with just the URL?
Thanks for any help you can provide!!
-Jenny
Are you trying to go from "[youtube=http://www.youtube.com/watch?v=JaNH56Vpg-A]" to http://www.youtube.com/watch?v=JaNH56Vpg-A?
This works if there's a white space between different URLs.
find: \[youtube=(\S*)\]
replace with: $1
It's difficult to replace every different service at once since it seems that their short codes are different. For Vimeo this would work. It allows a random number of white space between "vimeo" and URL. And it again needs the white space after closing "]".
find: \[vimeo\s+(\S*)\]
replace with: $1
Maybe theres more robust way to write the expression. (Which validates the correct syntax.) This one's pretty straightforward thought.
The actual regex syntax depend on the language used. Hope this helps.

Perl/lighttpd regex

I'm using regex in lighttpd to rewrite URLs, but I can't write an expression that does what I want (which I thought was pretty basic, apparently not, I'm probably missing something).
Say I have this URL: /page/variable_to_pass/ OR /page/variable_to_pass/
I want to rewrite the URL to this: /page.php?var=variable_to_pass
I've already got rules like ^/login/(.*?)$ to handle specific pages, but I wanted to make one that can match any page without needing one expression per page.
I tried this: ^/([^.?]*) but it matches the whole /page/variable_to_pass/ instead of just page.
Any help is appreciated, thanks!
This regexp should do what you need
/([^\/]+)/(.+)
First match would be page name, and the second - variable value
Try:
/([^.?])+/([^.?])+/
That should give you two matches.

I am trying to create an expression that will extract URLs

I want to extract URLs from a webpage these are just URLs by themselves not hyperlinks etc., they are just text. Some examples would be http://www.example.com, http://example.com, www.example.com etc. I am extremely new at regex so I have copy and pasted like 20 expressions online all failed to work. I don't know if I am doing it right or not. Any help would be really appreciated.
I wrote a post on using Regex to locate links within a HTML page (the intent was to use JavaScript to open external links or links to documents such as PDF's etc in a popup window).
The final regex was:
^(?:[./]+)?(?:Assets|https?://(?!(?:www.)?integralist))
The full post is here:
http://www.integralist.co.uk/javascript/regular-expression-to-open-external-links-in-popup-window/
The solution wont be perfect but might help point you in the right direction.
Mark
You're probably not escaping your .s. You need to use \. for each one.
Take a look at strfriend.com. It has a URL example, and represents it graphically.
The example it suggests is:
^((ht|f)tp(s?)://|~/|/)?(\w+:\w+#)?([a-zA-Z]{1}([\w-]+.)+(\w{2,5}))(:\d{1,5})?((/?\w+/)+|/?)(\w+.\w{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?