Program to generate regex easily? - regex

Let's say I have a url such as...
http://www.example.com/random-garbage-here-i-dont-want-12392/video2983439
Is there a program where I can just put this test string in, highlight/select the parts I want to keep, then get rid of the rest and turn it into a regex expression to use? I just can't figure out regex for the life of me.
I am trying to scrape URLs on a website but they are all unique except for a few consistent characteristics. The consistent characteristics are highlighted in bold above that I want to keep, while ignoring all the non-bold...that way when I'm crawling the website it will follow URLs that are similar to the bolded parts.

The following code worked for me in TCL
% regexp -- {http://www.example.com/[a-zA-Z0-9-]*/video[0-9]*} http://www.example.com/random-garbage-here-i-dont-want-1
2392/video2983439
1
%

Related

Regular expression for finding embedded javascript urls

First of all, sorry for the question name. Regex problems are hard to name.
I'm building a program for code-reviewing javascript files. The approach is black-box, so all we get is the html code from a web page for example.
The idea is to find all the javascript files present in the code and then analyzing them with some tool.
Im having some issues with finding the javascript files, mainly because each webpage is sort of different so something that works for every webpage is complicated.
I have found the following problems with solutions.
Case I
text = '"somenameforafile.js"'
js_found = re.findall('"(.+?).js"', text)
Case II
text = '"https://somenameforafile.js"'
js_found_2 = re.findall('"https://(.+?).js"',get_text)
In case II I can catch things like s3.amazonaws.bucketname with some further filtering
The problem is that Im finding things like the following (js is at the end):
setTimeout(ld,100)}a.P(1);var j="appendChild",h="createElement",k="src",n=d[h]("div"),v=n[j](d[h](z)),b=d[h]("iframe"),g="document",e="domain",o;n.style.display="none";m.insertBefore(n,m.firstChild).id=z;b.frameBorder="0";b.id=z+"-loader";if(/MSIE[ ]+6/.test(navigator.userAgent)){b.src="javascript:false"}b.allowTransparency="true";v[j](b);try{b.contentWindow[g].open()}catch(w){c[e]=d[e];o="javascript:var d="+g+".open();d.domain='"+d.domain+"';";b[k]=o+"void(0);"}try{var t=b.contentWindow[g];t.write(p());t.close()}catch(x){b[k]=o+'d.write("'+p().replace(/"/g,String.fromCharCode(92)+'"')+'");d.close();'}a.P(2)};ld()};nt()})({loader: "static.olark.com/jsclient/loader0.js",name:"olark",methods:["configure","extend","declare","identify"]});
Expected Output:
static.olark.com/jsclient/loader.js
Which could go into approach I, problem is I get basically all the text with that approach. Is it there any easy way to get urls embedded so much into random text?
You could use a negated character class [^\s"]+ to match 1 or more times not a whitespace char or a double quote and capture that in group 1.
Then match the js part \.js\b by escaping the dot and add a word boundary after js to prevent is being part of a larger word.
([^\s"]+)\.js\b
Regex demo

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?
What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.