Can YQL Open Data Tables make use of multiple URL fields that its XML scheme seems to support? - web-services

As I experiment more and more with making my own Open Data Tables for YQL I find what might be some gaps in the documentation. As I'm a hands-on learner and like to understand everything I use I probe these gaps to try to learn how everything works.
I've noticed that in the XML format for Open Data Tables, there is a <urls> "array" which usually contains just a single <url> element though sometimes there is no <url>. Here's the beginning of a typical ODT XML file:
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd" https="true">
<meta>
<author>Paul Donnelly</author>
<documentationURL>http://developer.netflix.com/docs/REST_API_Reference#0_52696</documentationURL>
</meta>
<bindings>
<select itemPath="" produces="XML">
<urls>
<url env="all">http://api.netflix.com/catalog/titles/</url>
</urls>
But I can't seem to find in the documentation whether it can ever contain more than one. I can't find any examples that do but when I try adding more than one everything works and no errors are thrown, though I also can't find any way to access the <url> elements beyond the first one.
Is there any use for the url/urls fields being an XML array? Is there any way to make use of more than one url here? Or is it just a quirk of the format that has no real reason?

Is there any use for the url/urls fields being an XML array?
Is there any way to make use of more than one url here?
The <url> elements can have an env attribute. This env attribute can contain all, prod, int, dev, stable, nightly, perf, qaperf, gamma or beta.
When the table is executed, the current environment (the YQL environment, not the more familiar environment file) is checked and the first matching <url> (if any) is used. If no matching env is found (and there is no all, which is pretty self-descriptive) then an error will be issued; for example, "Table not defined in this environment prod".
Note that for public-facing YQL, the environment is prod; only prod and all make sense to be used in your Open Data Tables.
Or is it just a quirk of the format that has no real reason?
Not at all.
I assume that this information is "missing" from the online documentation purely because it is only useful internally within Yahoo!, but equally it could just be another place where the docs are somewhat out-of-date.
Finally, none of the 1,100 or so Community Open Data Tables specify more than one <url>, and only a handful (55) make use of the env attribute (all using the value all).

Related

How to break caching on exist-db of included XSLs in Transform

I have a large set of XSLs that we recently went through and implemented a shared XSL template with common bits. We included an xsl:include in all the main XSLs now to pull these in. We had no issues at first until we started to make changes to the shared XSL.
For information, the whole system is web based, calling queries to dynamically format documents in the database given different XSLs through XSL FO and RenderX.
The main transform is:
let $fo := util:expand(transform:transform($articles, doc("/db/Customer/data/edit/xsl/Custbatch.xsl"), $parameters))
That XSL (Custbatch.xsl) has:
<xsl:include href="Custshared.v1.xsl"/>
If we make an edit to "Custshared.v1.xsl" is not reflected in the result because it is obvious that "Custshared.v1.xsl" is being cached and used. We know this because as you can see the name now includes "v1". If we make a change and change all the references say from v1 to v2, it all works. But this seems a bit ridiculous as that means we have to change the 18 XSLs that include this XSL or do something silly like restart the database.
So, what am I missing in the setup or controller.xql (which has the following on all not matched paths), to get things not to cache. I assume that is all internal so this setting likely does not matter. Is there some other setting in the config that does?
<dispatch xmlns="http://exist.sourceforge.net/NS/exist">
<cache-control cache="no"/>
</dispatch>
In reading the document here: http://exist-db.org/exist/apps/doc/xsl-transform.xml, it states:
"The stylesheet will be compiled into a template using the standard Java APIs (javax.xml.transform). The template is shared between all instances of the function and will only be reloaded if modified since its last invocation."
However, if I change an included XSL, it is not being used.
Update #1
I even went as far as creating a query that returns the XSL that is included, then I use:
<xsl:include href="http://localhost/get-include-xsl.xq"/>
This does work as formatting is not broken, but changing the underlying XSL yields the same result. So even that Xquery result is cached.
Update #2
And yes, through some simple test all is proven.
If I make any change to the root template (like add a meaningless space) and run, it does include the changes made in the include. If I only change the included XSL, no changes happen.
So lacking anything else, we could always write a Xquery that basically touches all the main templates after a change is made to the include template. Seems so wrong as a workaround.
Update #3
So the workaround we are currently using is that we have an unused "variable" in the XSL (version) and when we update the shared template, we execute that query which basically updates the value in that variable. At least it's only one XQuery and maybe we should attach to a trigger.
There is a setting in $exist-db-root$/conf.xml for the XSL transformer where you can turn off caching: <transformer class="net.sf.saxon.TransformerFactoryImpl" caching="no"> (The default is 'yes')

Regex or Xpath for extracting nodes?

I have an XML file with the following structure;
<JobList>
<Job><subnodes/></Job>
<Job><subnodes/></Job>
</JobList>
This xml can be broken sometimes leaving a missing ending of <JobList> and missing end of </Job>.
I would like to be able to extract the <Job> nodes with full content on those that are closed with </Job>. What is the best way to do this?
To make a long story short I am using .NET and built in serializers for deserializing xml content. But since new properties are added you cannot just go back and forth between different versions as it is to strict. Mostly it works, but I would like to have a backup recovery method for this - hence the question.
The current situation is that the deserializer "crashes" the whole deserializing when a new property has been added instead of ignoring it. I am looking to manually parse it on error.
As mentioned on the comments, the ideal would be to make the xml valid, if for whatever reason that is not possible, the workaround is parsing the file as text with a regex.
A general regex for this case could be something like:
<Job>((?!<Job>).)*</Job>$
this will bring anything between a complete pair
Please notice that this will also return nodes with 'broken' inner nodes, but according to your question you are only concerned about missing and tags.

Multi language website, how to approach this?

I have a website (Coldfusion) on which I want to offer multi language, but no idea what is the best way to do this.
There 2 plans I have:
1:
Of course all content (text) is in a database.
If a user would want a different language, the user would click on a link/flag, this would put the requested language in a session variable, for example: session.language = "es"
In the database I would have 2 columns (every language has 1 column) and then select the text which belongs to 'es'
Every page would then do a request to the database to get the text beloging to the session.language.
PROS: Relatively simple to implement
CONS: SEO wise I don't think this could be very good. http:// www.domain.com/page.cfm would give an english text or spanish text (or other language). Google will not add duplicate URL's
2:
Do something with http:// www.domain.com/en/page.cfm for english and http:// www.domain.com/es/page.cfm for english.
With a URL rewrite rule the language value in the URL http:// www.domain.com/en/page.cfm would actually be a page http:// www.domain.com/page.cfm?language=en
The url.language variable will then select the correct language from the database.
PROS: Unique URL for each language. Good for SEO and Google indexing.
CONS: A bit more difficult to implement. (I think)
Or does anyone have other / better ideas?
Thanks!!
You should always first check the browser header "Accept-Language" for the default language(s) (the correct standard way to do it), and offer links (the intuitively seemingly right way) only as an alternative.
Doing it in a database doesn't seem very standard. Let's assume you would like to use MVC architecture (model-view-controller). Most software uses keys in the presentation layer (view) (eg. html) and along with the presentation layer, you have language files (in Java, this is typically properties files) which are mapped simply by their filenames, and can be modified by regular users, without any special skills, such as professional translators with no computer skills. Certainly you could put it in a database, but then it is just more work, and moves the information out of the presentation layer.
There are various libraries for doing this. You should find the normal one for your application. Please edit your question to include what you are using to develop the application. (eg. JSP, Tapestry, Wicket, ASP, PHP, etc.) So for example, if you wanted to use JSPs, I would then suggest you use the JSTL tag library's language support. Or if you were using Tapestry, I would point you to http://tapestry.apache.org/localization.html or http://tapestry.apache.org/tapestry4.1/UsersGuide/localization.html
To look it up, you can look for the terms "internationalization" aka "i18n", or "localization". (The terms don't mean the same thing, but few use them correctly, so either works. http://www.w3.org/International/questions/qa-i18n)
I would go for option 2. Every translation should have its own url. Links to your website will already be in the intended translation.
To store translations in a database, I wouldn't put every translation in a seperate column, but rather put them in a seperate table:
Table Posts:
- id
- title_id
- ...
Table Translations:
- label_id
- value
- country_code
- language_code
Where title_id matches label_id
This way you won't have to alter your table structure when a new translation is added. This allows you to have infinite translations for any label or text.
To effectively do a multi-lingual site then you need set a rule for yourself that NO TEXT is ever put in the source as hard coded. It either needs to come from the database and / or a Resource Bundle.
Text from the database
You need to make sure that the column you are storing your data in is unicode otherwise you'll have issues with accented character. Also don't have a column per language as this is not scalable, do what #jan suggests and have a translations table where the items are keyed on a reference as well as a language.
Resource bundles
You are not going to want to get every last little bit of text from the database so for those you can utilise a resource bundle. This is an, admittedly old, link http://www.sustainablegis.com/blog/cfg11n/index.cfm?mode=entry&entry=FD48909C-50FC-543B-1FE177C1B97E8CC1 from Paul Hastings's blog about some solutions to resource bundles. To be honest his blog is an excellent resource on this very subject.
With regards to how you handle the URLs do not do option 1 as you quite rightly identified you will cause issues the SEO rankings of the page and it will mean that users cannot correctly share or return to the page.
Two approaches are having the language code in the URL as you identified in option 1.
Pros
Simpler to configure
Cons
You have one application which means that as you add more languages you add more complexity and weight on the memory of that app
Or you can have a different sub domain or domain per application e.g. es.yourdomain.com or yourdomain.es they can all be the same codebase
Pros
Each language is a standalone application meaning it has it's own memory
Cons
more effort to configure
http://i18n.riaforge.org/ has a download for i18n. It can be used to make sure that all string labels match. That way if some one wants to change "Save" to "Update", it can all be done in one spot.
It is also important to consider the technical background of those that will being doing the translation. It is often easier to get the translation team to edit files in notepad as opposed to updating a db. Text files work well with version control.
The best way i found is to use an XML to hold just that pages language stuff, one xml to cover each page, and you then vary it for language. when the page loads, just load a different XML from the database or files... many ways to do this. all other methods i have tried have their issues, and at least this one allows you to take a language XML, hand it to someone who will copy it, and then change the boxes... you put it in the DB to serve it.
one can also do this for text, and have the DB make the XML for just the text for that page by using a list of items to include in the XML for the page.
once you get the idea, the rest becomes very easy...
and given CF ways of accessing such data with dot notation, easy peasy to us
say you have "Load Images"
in english xml it may be <LoadIMGS>Load Images</LoadIMGS>
in chinese it may be <LoadIMGS>加载图像</LoadIMGS>
or <LoadIMGS>Jiā zǎi túxiàng</LoadIMGS>
regardless, in your CFM code you would just put #variablename.LoadIMGS# in the place... i would also suggest putting in the loadimages tag the size the font should be adjusted to if not normal size. that way, when translations are too large, you can shrink that font there for that... etc.
enjoy!!!

List of tags not available ColdFusion 9 script syntax?

I'm looking for a complete list of tags that are not available in ColdFusion 9 script syntax.
Example:
CFSetting: is one example that is available in Railo but not in CF9 for use in cfscript
CFDocument: I can't find this one so far.
Not an official list by any measure, but this is a list I presented to a private forum a while back, and it didn't receive too much correction (and those corrections have been integrated). It was in the context of what CF does and doesn't need to be implemented, to claim 100% coverage in CFScript.
Summary of omissions:
These ones are significant omissions:
<cfcollection>
<cfexchangecalendar>
<cfexchangeconnection>
<cfexchangecontact>
<cfexchangefilter>
<cfexchangemail>
<cfexchangetask>
<cfexecute>
<cfindex>
<cfinvoke> (support for dynamic method names)
<cflogin>
<cfloginuser>
<cflogout>
<cfmodule>
<cfoutput> (implementation of query looping with grouping)
<cfparam> (fix the bug in that enforced requiredness doesn’t work (ie: param name="foo";))
<cfsearch>
<cfsetting>
<cfwddx>
<cfzip>
<cfzipparam>
There’s a reasonable case for these ones to be implemented:
<cfassociate>
<cfcache>
<cfcontent>
<cfflush>
<cfhtmlhead>
<cfheader>
<cfntauthenticate>
<cfprint>
<cfschedule>
<cfsharepoint>
These ones... I’m ambivalent:
<cfgridupdate>
<cfinsert>
<cfobjectcache>
<cfregistry>
<cfreport>
<cfreportparam>
<cftimer>
<cfupdate>
We don’t need these ones at all, I think:
<cfajaximport>
<cfajaxproxy>
<cfapplet>
<cfcalendar>
<cfchart>
<cfchartdata>
<cfchartseries>
<cfcol>
<cfdiv>
<cfdocument>
<cfdocumentitem>
<cfdocumentsection>
<cffileupload>
<cfform>
<cfformgroup>
<cfformitem>
<cfgraph>
<cfgraphdata>
<cfgrid>
<cfgridcolumn>
<cfgridrow>
<cfinput>
<cflayout>
<cflayoutarea>
<cfmap>
<cfmapitem>
<cfmediaplayer>
<cfmenu>
<cfmenuitem>
<cfpod>
<cfpresentation>
<cfpresentationslide>
<cfpresenter>
<cfselect>
<cfsilent>
<cfslider>
<cfsprydataset>
<cftable>
<cftextarea>
<cftextinput>
<cftooltip>
<cftree>
<cftreeitem>
<cfwindow>
If there's anything here that you think ought to be included in CFScript, please raise an issue here - http://cfbugs.adobe.com/cfbugreport/flexbugui/cfbugtracker/main.html - and cross reference the issue number here.
HTH.
I would argue that there are no commands that are not available as script as you can extend and write the missing bits using cfc's.
Thus wrap your favourite missing <cftag in a cfc and call it using new
However, here is a list of what is supported
http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSe9cbe5cf462523a02805926a1237efcbfd5-7ffe.html

Cleansing string / input in Coldfusion 9

I have been working with Coldfusion 9 lately (background in PHP primarily) and I am scratching my head trying to figure out how to 'clean/sanitize' input / string that is user submitted.
I want to make it HTMLSAFE, eliminate any javascript, or SQL query injection, the usual.
I am hoping I've overlooked some kind of function that already comes with CF9.
Can someone point me in the proper direction?
Well, for SQL injection, you want to use CFQUERYPARAM.
As for sanitizing the input for XSS and the like, you can use the ScriptProtect attribute in CFAPPLICATION, though I've heard that doesn't work flawlessly. You could look at Portcullis or similar 3rd-party CFCs for better script protection if you prefer.
This an addition to Kyle's suggestions not an alternative answer, but the comments panel is a bit rubbish for links.
Take a look a the ColdFusion string functions. You've got HTMLCodeFormat, HTMLEditFormat, JSStringFormat and URLEncodedFormat. All of which can help you with working with content posted from a form.
You can also try to use the regex functions to remove HTML tags, but its never a precise science. This ColdFusion based regex/html question should help there a bit.
You can also try to protect yourself from bots and known spammers using something like cfformprotect, which integrates Project Honeypot and Akismet protection amongst other tools into your forms.
You've got several options:
"Global Script Protection" Administrator setting, which applies a regular expression against post and get (i.e. FORM and URL) variables to strip out <script/>, <img/> and several other tags
Use isValid() to validate variables' data types (see my in depth answer on this one).
<cfqueryparam/>, which serves to create SQL bind parameters and validate the datatype passed to it.
That noted, if you are really trying to sanitize HTML, use Java, which ColdFusion can access natively. In particular use the OWASP AntiSamy Project, which takes an HTML fragment and whitelists what values can be part of it. This is the same approach that sites like SO and slashdot.org use to protect submissions and is a more secure approach to accepting markup content.
Sanitation of strings in coldfusion and in quite any language is very important and depends on what you want to do with the string. most mitigations are for
saving content to database (e.g. <cfqueryparam ...>)
using content to show on next page (e.g. put url-parameter in link or show url-parameter in text)
saving files and using upload filenames and content
There is always a risk if you follow the idea to prevent and reduce a string by allow basically everything in the first step and then sanitize malicious code "away" by deleting or replacing characters (blacklist approach).
The better solution is to replace strings with rereplace(...) agains regular expressions that explicitly allow only the characters needed for the scenario you use it as an easy solution, whenever this is possible. use cases are inputs for numbers, lists, email-addresses, urls, names, zip, cities, etc.
For example if you want to ask for a email-address, you could use
<cfif reFindNoCase("^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:[A-Z]{5})$", stringtosanitize)>...ok, clean...<cfelse>...not ok...</cfif>
(or an own regex).
For HTML-Imput or CSS-Imput I would also recommend OWASP Java HTML Sanitizer Project.