KimonoLabs crawler Generated URL List with regex - regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?

Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

Related

Regex for multiple URLs without clear pattern

I'm quite new to using regex so I hope there's someone who can help me out. I want to set up an event on Google Tag Manager through RegEx that fires whenever someone views a page. I'm trying to do this using the Page URL as a parameter so that the event hits, when that URL is visited. Its for around 1400 urls that are in the same sub-folder but have a different page name. For example: https://www.example.com/products/product-name-1, https://www.example.com/products/product-name-2
What would be the best way to group these into one RegEx formula?
I've tried to separate all urls by using the '|' sign without any result. I've also tried this format, without any luck: (^/page-url-1/$|^/page-url-1/$|^/page-url-1/$|^/page-url-1/$)
A couple things are happening with your attempt. First, you aren't escaping the '/'. This is a reserved or special character and you will need to precede it with a \ to tell the engine that you want that specific character. It would look like this:
\/products\/page-url-1
I am assuming you are using a {{Page Path}} so the above would match for any paths that contain /products/page-url-1.
If you want the event to fire on all pages within the /products directory, there is an easier way of doing this.
\/products\/.*
what this will do is match any pages within your /products directory. If you have a landing page on /products, this will be omitted from the firing. The '.' means it will then match any character after the / and '*' means it can do this unlimited times.
EDIT:
Since you aren't looking for all the products pages, you can you a matching group and list them all. I suspect that all the product names will be different enough and not share any common path elements so you will have to list out the ones want.
\/products\/(product-url-1|product-url-2|product-url-3).*

Google Analytics Regex excluding a certain url in a sub folder

Currently on my GA Account I have the following URL's from our website tracked:
domain/contact-us/
domain/contact-us/global-contact-list.aspx
domain/contact-us/contactlist.aspx
The first two are from our new website which we want to track, the last one is from our old website (traffic is still being tracked but we do not want to use this)
I tried using a regex filter on this as the following:
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/)
Reading up, I believe this looks for matches of exactly:
/contact-us/global-contact-list or /contact us/ but would disallow /contact-us/contactlist/
for some reason, the above one is still coming through. Can someone please see as to why this may be happening or know why this is happening?
You need to add a negative look-behind or a end of string anchor:
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/$)
or
(^/contact-us/global-contact-list\.aspx)|(^/contact-us/(?!contactlist/))
This way, you will exclude /contact-us/contactlist/ from matching.
Have a look at the Demo 1 and Demo 2.
BTW, /contact us/ will not pass since (^/contact-us/) only allows a hyphen. You should add a space, e.g. (^/contact-us/global-contact-list\.aspx)|(^/contact[-\s]us/$).
Also, (^/contact-us/global-contact-list\.aspx) won't match /contact-us/global-contact-list because it needs to match .aspx.

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.
(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Why isn't DownThemAll able to recognize my reddit URL regular expression?

So I'm trying to download all my old reddit posts using a combination of AutoPagerize and DownThemAll.
Here are two sample URLs I want to distinguish between:
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2med97
http://www.reddit.com/r/China/comments/kqjr1/what_is_the_name_of_this_weird_chinese_medicine/c2meana?context=3
The regexp I'm trying to use is this: (\b)http://www.reddit.com/([^?\s]*)?
I want all my reddit posts downloaded, but I don't want any redundancy, so I want to match all of my reddit posts except for anything with a question mark (after which there's a "context=3" character).
I've used RegEx Buddy to show that the regexp fits the first URL but not the second one. However, DownThemAll does not recognize this. Is DownThemAll's ability to parse regexp limited, or am I doing something wrong?
For now, I've just decided to download them all, but to use a renaming mask of *subdirs*.*text*.*html* so that I can later mass remove anything containing the word "context" in its filename.
Reddit does have an API, you might want to take a look at that instead, might be easier.
https://github.com/reddit/reddit/wiki/API
EDIT: Looks like http://www.reddit.com/user/USERNAME/.json might be what you want