Surprise! I have another Apache Nutch v1.5 question. So in crawling and indexing our site to Solr via Nutch, we need to be able to exclude any content that falls under a certain path.
So say we have our site: http://oursite.com/ and we have a path that we don't want to index at http://oursite.com/private/
I have http://oursite.com/ in the seed.txt file and +^http://www.oursite.com/([a-z0-9\-A-Z]*\/)* in the regex-urlfilter.txt file
I thought that putting: -.*/private/.* also in the regex-urlfilter.txt file would exclude that path and anything under it, but the crawler is still fetching and indexing content under the /private/ path.
Is there some kind of restart I need to do on the server, like Solr? Or is my regex not actually the right way to do this?
thanks
My guess is that the url is accepted by first regex and the second one isn't checked anymore. If you want to deny URLs, put their regexes first in list.
Related
I am trying to create a view in Google Analytics to filter out the analytics from multiple subdirectories and all pages in them.
www.example.com/mysite
www.example.com/anothersite
www.example.com/lastsite
This is the regex I have written but when I run it, no results get returned ^/(mysite|anothersite|lastsite)?/*
Any ideas what I am doing wrong?
I was able to find a solution by first adding a trailing slash to the URLs (see: https://www.getelevar.com/how-to/fix-duplicate-url-google-analytics/) and then including a regex pattern of the request URI of ^/(mysite|anothersite|lastsite)/
In Symfony 2.8.
I want to list every URL's permission(e.g.roles) in order to find which URL is not protected.
The list result format is same as security config about access_control option.
How to do ?
As far as i know and after some research what you are looking for doesnt exist out of the box. You could look into expanding the php bin/console debug:router command to include what security checks exist for each.
Another option would be to manually go through all the routes listed in the debug output and look at the security requirements in the _profiler.
We have a requirement where we need to crawl one particular set of URLs.
Say for example we have site abc.com. We need to crawl abc.com/test/needed -- all URL matching this pattern under "needed" folder. But we don't want to crawl rest of the URLs under abc.com/test/.
I guess this will be done using RegEx. Can anyone help me with respect to RegEx?
going from what you said in the comment, a pattern to match things of the form /xyz but not things of the form /xyz/imp:
/xyz(/[^i][^m][^p].*)?|/xyz/.{0,2}
The pattern that can be added to the GSA can be:
abc.com/test/needed
or
contains:abc.com/test/needed
The thing to consider is how the GSA will get this documents. If it can't spider to the folder it won't find the documents.
There are 3 specifications that you are allowed to make, in the GSA.
Start Crawl URLs -- these tell the GSA where to start looking for links.
Follow and crawl only URL patterns -- these tell the GSA which URLs from among those found starting with the "Start Crawl URLs", need to be followed and indexed.
Do not crawl URLS -- these are specifications for URLs patterns that match the above 2 patterns, but should not be crawled.
From as much as has been specified in the question itself, I think all you'll need to do is, put in a "Start crawl" url as: "abc.com/" and put in a "Follow and crawl only" specification as: "abc.com/test/needed/", assuming you need no other path/folder on the site crawled.
I want to create a filter for a profile that looks at 2 things, a subdomain (subdom.mycomp.com) and a folder within the regular domain (www.mycomp.com/industrysolutions/).
Included is a screenshot of the current filter however it only reports on pages in the folder. I'm not sure if I'm on the right track?
Any suggestions?
I don't believe you can have two include filters to a profile. Like you said, the data just stops flowing in, so what I would do is create a filter for the hostname (see below), and then create an advanced segment that looks exclusively at the request URI's you're after.
Include Hostname subdomain\.mycomp\.com
Found this in another post but I needed to make some minor edits to make it work:
I have a bespoke CMS for a website that stores any uploaded files in the /Assets/ folder.
I'm preparing to move the website to the Azure platform and need some way of rewriting the links within the web pages. Here is what a current link looks like:
link to file
What do you suggest would be the best way to change those links to something like:
link to file
There are 100's of pages with tons of links. Also, to throw a spanner in the works, not all links are in sub folders within the assets folder.
Some links are like:
link to file
Suggestions are welcome, I'm open to anything, regex, htmlagilitypack or plain old string.Replace but I can't seem to get my head around how to do it...
You should take a look at the IIS Url Rewriting module which is installed on Azure web roles.
http://www.iis.net/learn/extensions/url-rewrite-module/using-the-url-rewrite-module
It basically allows you to define patterns using regex and ouput URLs however you like. In your case it would be quite simple to rewrite your embedded links to their locations in blob storage.
It's ok, I have sorted it, thanks for your comments. If anyone is interested I did a search & replace SQL query on the database.