Nutch Domain Regular Expression - regex

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.

(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

Related

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

How to match plain text URL in a markdown?

I'm currently trying to match all plain text links in a markdown text.
Example of the markdown text:
Dude, look at this url http://www.google.com .. it's a great search engine
I would like it to be converted into
Dude, look at this url <http://www.google.com> .. it's a great search engine
So in short, processing url should become <url>, but processing existing <url> shouldnt become <<url>>. Also, the link in the markdown can be in the form of (url), so we'll have to avoid matching the normal brackets too.
So my working regex for matching the plain text url in java is :
"[^(\\<|\\(](https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|][^(\\>|\\)]",
with [^(\\<|\\(] and [^(\\>|\\)] to avoid matching the wrapping brackets.
But here lies one problem where i also do not want to match this kind of url :
[1]: http://slashdot.org
So, if the markdown text is
Dude, look at this url http://www.google.com .. it's a great search engine
[1]: http://slashdot.org
I want only http://www.google.com to be matched, but not the http://slashdot.org.
I wonder what's the pattern to meet this criteria ?
What you have here is a parsing problem. Regexes are fine, but just using regexes here will make it a mess (supposing you achieve it). After you fix this problem, you'll probably find yourself facing other ones, like URL in code (between ` or in lines starting with tabs or four spaces) that you don't want to replace.
A solution would be to split into lines and then
detect patterns (for example ^\[\d+\]:\s+)
apply your replacements (for example this URL to link change) only on lines which doesn't follow an incompatible pattern
That's the logic I use in this small pseudo-markdown parser that you can test here.
Note that there's always the solution to use an existing proved markdown parser, there are many of them.

Regex select XML Element (containing hyphen) and inside content

I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.

I am trying to create an expression that will extract URLs

I want to extract URLs from a webpage these are just URLs by themselves not hyperlinks etc., they are just text. Some examples would be http://www.example.com, http://example.com, www.example.com etc. I am extremely new at regex so I have copy and pasted like 20 expressions online all failed to work. I don't know if I am doing it right or not. Any help would be really appreciated.
I wrote a post on using Regex to locate links within a HTML page (the intent was to use JavaScript to open external links or links to documents such as PDF's etc in a popup window).
The final regex was:
^(?:[./]+)?(?:Assets|https?://(?!(?:www.)?integralist))
The full post is here:
http://www.integralist.co.uk/javascript/regular-expression-to-open-external-links-in-popup-window/
The solution wont be perfect but might help point you in the right direction.
Mark
You're probably not escaping your .s. You need to use \. for each one.
Take a look at strfriend.com. It has a URL example, and represents it graphically.
The example it suggests is:
^((ht|f)tp(s?)://|~/|/)?(\w+:\w+#)?([a-zA-Z]{1}([\w-]+.)+(\w{2,5}))(:\d{1,5})?((/?\w+/)+|/?)(\w+.\w{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

How can I get only href value from link

I have many links in my page.
For example Australia
Now I want only the href with its value i.e (href="/promotions/download/schools/australia.aspx") with vbscript regular expression.
My regex would be something like:
href="([^"]*)"
Might need escaping in your context but that (or something very much like it) should work.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). Luckily, you should have access to the best parser available: the web browser. Modern browsers create a Document Object Model which is a tree structure that contains all of the information about the page. One of the methods you can call on the DOM is links. I don't really know vbscript, but this code looks like it should work:
For i = 0 To document.links.length
document.write(document.links(i).href & "<BR>")
Next