How to Make Relative Href Work in SvelteKit? - href

I want to build a Web app with SvelteKit with one page listing all items (with potential search query parameters), and then one page for each individual item. If I had to build this the old school way with everything generated in the backend, my paths would be /items/ for a the listing of items, and /items/123 for item 123, etc. That is, to go to the page of item 123, a link will with href="123" will work no matter if you are currently at the index (/items/) or at the page of one particuler item (/items/[id]).
With SvelteKit, if I create files routes/items/index.svelte and routes/items/[id].svelte, then routes/items/index.svelte will have path /items, without a trailing slash, and as a result a link with href="123" will lead to /123, resulting in a "not found" error.
This same link will work however from the page of an individual item, say, /items/456.
This is radically different from what you would have in the traditional HTML model, where a link from /items/ (or /items/index.html) would work the same as a link from /items/[id].html.
Now in svelte.config.js there is a trailingSlash option you can set to always so that routes/items/index.svelte corresponds to path /items/, but then routes/items/[id].svelte has path /items/[id]/ and we have the same problem again: one href value cannot work from both the index and the page of an individual item.
The only way I see right now is to use absolute path, but it's not very composable. My guess is that there is something I am doing wrong.

You're not missing anything - it's not currently possible in SvelteKit to have a trailing slash for some pages but not for others. There is an open GitHub issue you may be interested in that proposes adding additional trailingSlash options. This issue cites the exact problem you described:
The trailingSlash options introduced in #1404 don't make it straightforward to add trailing slashes to index pages but not to leaf pages. For example you might want to have /blog/ and /blog/some-post rather than /blog and /blog-some-post, since that allows the index page to contain relative links.
Until that feature is added, you'll need to use absolute paths.

Related

Exclude urls without 'www' from Nutch 1.7 crawl

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
Here is the code on pastebin.
There are at least a couple solutions.
1.) urlfilter-regex plugin
If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.
In your case you would use a rule like:
+^(https?://)?www\.
This will match anything that starts with:
https://www.
http://www.
www.
and therefore will only allow such URLs to be crawled.
Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.
In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, e.g.:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>
Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
and regex-urlfilter.txt should be in the conf directory for nutch.
There is also the option to only perform the filtering at different steps, e.g., index-time, if you only want to filter than.
2.) solrdedup command
If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup
This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.
However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.
3.) Write a custom indexing filter plugin
You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.
It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.
You can also find a lot of examples:
http://wiki.apache.org/nutch/WritingPluginExample
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Is there a way to get a permanent link (permalink) to the current version of a DokuWiki page?

Is there a way to generate a permanent link to the current version of a page? I can get a link to previous versions (for example https://www.dokuwiki.org/faq:support?rev=1354115567) by clicking Old Revisions.
Using ?rev=0 in the URL will always take you to the current revision at the time. Maybe a null-edit would help (the content doesn't change and you may get a edit history entry).
Now, if you have actual access to the fileserver where your DW is hosted, you can fetch the id of the most recent revision (the at-the-moment-current-one) by first checking the id of the latest edit in Old Revisions, then going to the ${DOKUWIKI}/data/attic directory and checking the numbers that correspond to the pagename that you want. There will be one file with a more recent id (a higher numeric value) which, if I'm not mistaken, corresponds to the current revision. For example, for a last edit of mypage.1263571254.txt.gz you might find one higher index of, say, mypage.1291408231.txt.gz.
EDIT: the same value does show somewhere in the rendered page's source. If you have section editing enabled and the buttons do show up in the page, search the source (CTRL+U on most browsers) for form class="button btn_secedit", then continue reading until you read an input element with name="rev" and a value= that should correspond to the number of the current revision.
Using that found id as the argument of ?rev= should reveal if it is the correct one and as a result get you the permalink to this revision.
Note: I have not tried this in my own install.

Django and tinymcs: getting list of image files using external_image_list_url

I am trying to get external_image_list_url to work with tinymcs and Django. My understanding is that the user will be able to see a list of his or her images when clicking on the image icon (the one with the tree). From here an image can be selected and inserted.
Am I right that it is the icon with the tree? It's the one called "image". The one called "insertimage" doesn't work at all - the icon is not displayed. It and insertfile are the only ones that aren't displayed. I'm Ok with that unless I need it for this list functionality.
First I am trying just to get any image to appear in a list. I have done the following:
Created myexternallist.js and placed it where I keep my other JS files. I can access these other files via src = "/media/js/filename.js" because of my django settings. But is this also what I should put for:
tinyMCE.init({
external_image_list_url : "/media/js/myexternallist.js",
... })
In this file it says:
var tinyMCEImageList = new Array(["Logo 1", "/media/js/photo.jpg"],);
photo.jpg is in the same folder as myexternallist.js
I have also tried just "photo.jpg" and various other combinations. Not sure if my issue has to do with my relative paths or something else altogether. I'm not sure what an absolute path should be. Right now I'm working on localhost, but won't always be.
Solved it, argh. The issue was the comma near the end of the line "var tinyMCEImageList =..."
Now it works fine with the relative urls starting with /media
The clue was given by Firebug Console, which showed me the js error. I just happened to click there, but will be using that a lot from now on!

Configuring LucidWorks Include Paths to only crawl certain file types

I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1 and when I leave Include paths blank, it crawls the whole sub-tree as expected.
I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html would not work, since .* should match any character.
As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.
The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/ to the Include paths fixed the problem.

Is adding a robots.txt to my Django application the way to get listed by Google?

I have a website (Django) on a linux server, but Google isn't finding the site at all. I know that i don't have a robots.txt file on the server. Can someone tell me how to create one, what to write inside and where to place it? That would be a great help!
robot txt is not for google find your site. i think you must register your site to google and also add sitemap.xml
Webmaster Tools - Crawl URL ->
https://www.google.com/webmasters/tools/submit-url?continue=/addurl&pli=1
also see this for robot.txt
Three ways to add a robots.txt to your Django project | fredericiana
-> http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/
what is robot.txt
It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
from What is Robots.txt -> http://www.webconfs.com/what-is-robots-txt-article-12.php
robot.txt files are used to tell search engines which content should or should not be indexed. The robot.txt files is in no way required to be indexed by a search engine.
There are a number of thing to note about being indexed by search engines.
There is no guarantee you will ever be indexed
Indexing takes time, a month, two months, 6 months
To get indexed quicker try sharing a link to your site through blog comments etc to increase the chances of being found.
submit your site through the http://google.com/webmasters site, this will also give you hints and tips to make your site better as well as crawling stats.
location of robots.txt is same as view.py and this code
in view
def robots(request):
import os.path
BASE = os.path.dirname(os.path.abspath(__file__))
json_file = open(os.path.join(BASE , 'robots.txt'))
json_file.close()
return HttpResponse(json_file);
in url
(r'^robots.txt', 'aktel.views.robots'),