media monks crawler, blacklist

media monks crawler, blacklist - regex

I'm using MediaMonks crawler to crawl some websites.
Packagist link
There is a function called blacklist, and I'd like to use that to avoid crawling all url's that has hashtags in them.
Something like this:
// TODO: Write the correct regular expression.
$crawler->addBlacklistUrlMatcher(new Matcher\PathRegexUrlMatcher('/#/'));
I'm really bad with regular expressions, can anyone help me with this?

It probably depends on how blacklist matcher works, but in general case if you want to catch the whole line containing # symbol in it, this is what you need to put into brackets:
.*\#.*
This will catch whole line(s) containing # symbol, for instance:
#somehashtag
#some hash tag
This site will help you to create regex for your needs: https://regex101.com

Related

How do I write regex to exclude words contained inside a file?

I want to perform email validation, which excludes emails of popular domains. As a reference, I am using email domains from here: https://github.com/mailcheck/mailcheck/wiki/List-of-Popular-Domains. I am planning to put them inside a resource file.
How do I write a regex, which will exclude the emails, ending on these domains (i.e. which will exclude ending on the words from this file)?
I want to write it in typescript, if this matters.

What you are looking for is a negative lookaround.
(I'm gonna simplify the first part of the regex so it's easily understandable)
^([^#]*#(?!gmail.com|aol.com|mailinator.com|...)[^#]*?$)
However, since you would have to programmatically build this string, and it would be a huge regex, you might want to consider other options like parsing the #... bit of your input and doing a simple: arrayOfDomains.indexOf(thisDomain) > -1 to detect if it's included in the list

import.io and portia regex url patterns

I am using data scrapers: Import.io & Portia.
They both allow you to define a regular expression for the crawler to abide by.
for example the url: https://weedmaps.com/dispensaries/pdi-medical
how would I account for the ending "pdi-medical"?
I've looked all over and understand how to use regex in a JS environment, but I'm a little confused as to what I'd exactly put in the input on Portia/Import.io
Something like this?
https://weedmaps.com/dispensaries//^[a-zA-Z0-9-_]+$/

For Portia, if you want your crawler to follow any URLs starting with https://weedmaps.com/dispensaries/, you can just add a crawling rule with the following regex:
^https?://weedmaps.com/dispensaries/

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?

Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

Nutch Domain Regular Expression

I am following the tutorial here, trying to build a robot against a website.
I am in a page that contains all the product categories. Say it is www.example.com/allproducts.
After diving into each category. You can see the product list in a table format and you can click the next page to loop through all the pages inside that category. Actually you can only see the 1,2,3,4,5, last page.
The first page in the category has a URL looks like www.example.com/level1/level2/_/N-1, then the second page will looks like www.example.com/level1/level2/_/N-1/?No=100 .. so on an so forth..
I personally don't have that much JAVA programming experience and I am wondering
can I crawl the all the products list page using Nutch and store the HTML for now..
and maybe later figure out a way to parse the html/index correctly.
(1) Can I just modify conf/regex-urlfilter.txt and replace
# accept anything else
+.
with something correct? (I just don't understand how could
+^http://([a-z0-9]*\.)*nutch.apache.org/
only restrict the URLs inside the Nutch domain..., I will interpret that regular expression to be between the double slash and nutch, there could be any characters that are alpha numeric or asterisk, backslash or dot..)
How can I build the regular expression so it only scrape http://www.example.com/.../.../_/N-../...
(2) I can see the HTML is stored in the content folder inside segment... However, when I open that file in VI, it just totally looks like nonsense to me... and I am wondering if that is the so-called JAVA serialization which I need to deserialize in JAVA to read it.
Forgive me if those questions are too basic and thanks a lot for reading.

(1) Can I just modify conf/regex-urlfilter.txt and replace
Sure. You should replace +. with these lines:
#accept all products page
+www\.example\.com/allproducts
#accept categories pages
+www\.example\.com/level1/level2/_/N-
One important note about regex in this file: the regular expressions are partially match. So if you write a rule like "+ab" it means: accept all urls that contain "ab" so it matches with these urls
ab
abc
http://ab.com/c.html
By default, nutch filter urls with ? (since mostly they are dynamic pages). To prevent this, comment this line in you regex-urlfilter.txt file:
-[?*!#=]
(2) I can see the HTML ...
Nutch saves the files in binary format. See https://stackoverflow.com/a/10150402/1881318

Need to create a gmail like search syntax; maybe using regular expressions?

I need to enhance the search functionality on a page listing user accounts. Rather than have multiple search boxes for each possible field, or a drop down menu where the user can only search against one field, I'd like a single search box and to use a gmail like syntax. That's the best way I can describe it, and what I mean by a gmail like search syntax is being able to type the following into the input box:
username:bbaggins type:admin "made up plc"
When the form is submitted, the search string should be split into it's separate parts, which will allow me to construct a SQL query. So for example, type:admin would form part of the WHERE clause so that it would find any record where the field type is equal to admin and the same for username. The text in quotes may be a free text search, but I'm not sure on that yet.
I'm thinking that a regular expression or two would be the best way to do this, but that's something I'm really not good at. Can anyone help to construct a regular expression which could be used for this purpose? I've searched around for some pointers but either I don't know what to search for or it's not out there as I couldn't find anything obvious. Maybe if I understood regular expressions better it would be easier :-)
Cheers,
Adam

No, you would not use regular expressions for this. Just split the string on spaces in whatever language you're using.

You don't necessarily have to use a regex. Regexes are powerful, but in many cases also slow. Regex also does not handle nested parameters very well. It would be easier for you to write a script that uses string manipulation to split the string and extract the keywords and the field names.
If you want to experiment with Regex, try the online REGex tester. Find a tutorial and play around, it's fun, and you should quickly be able to produce useful regexes that find any words before or after a : character, or any sentences between " quotation marks.

thanks for the answers...I did start doing it without regex and just wondered if a regex would be simpler. Sounds like it wouldn't though, so I'll go back to the way I was doing it and test it again.
Good old Mr Bilbo is my go to guy for any naming needs :-)
Cheers,
Adam

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js