How can I query Zenodo's related identifiers fields? - zenodo

The Zenodo open data repository offers a web-based query interface with a sophisticated query language. However, I can't get queries for related identifiers (e.g. a data set that supplements a GiHub repository) to return anything.
For example, for this Unix history repository Zenodo data set, the queries for the corresponding GitHub repository
(related_identifiers.identifier:"https://github.com/dspinellis/unix-history-repo") and the DOI of the publication that documents it
(related_identifiers.identifier:"10.1007/s10664-016-9445-5") return no results. Even simpler queries, such as for data sets whose related identifiers is a DOI (related_identifiers.scheme:doi) or for data sets associated with a supplement relationship (related_identifiers.relation:isSupplementTo) fail to return any results. Other queries, such as for data sets with restricted access rights (accessrights:restricted) or those by a specific creator (creators.orcid:0000-0003-4231-1897) work fine.

It seems that this is a limitation of the current query interface. The query language does not support queries for nested fields, such as the related identifiers. The provided documentation was misleading and has been corrected through this pull request. In addition, a Zenodo developer commented that the corresponding identifiers can be accessed through the related.identifier keyword, for example related.identifier:"10.1109/TSE.2019.2892149".

Related

C++ PostgreSQL Library, describe SQL, without execution

I have asked this question on Github, issue #641, since it seems like a deficiency or an enhancement question.
Most databases (e.g. Oracle, SQL*Server, Sybase, ... ) send back a description of the resultset, even if there are zero rows. I cannot seem to find this in pqxx, nor can I find a way to describe SQL of any kind prior to execution. The only way I have been able to get the SQL metadata is by collection upon the first row retrieved, but this is limited to an actual row being retrieved. If there are zero rows, I have no way to determine the metadata.
Looking at the v15.x code base for libpq (standard C library), I see that psql has a "workaround" for this very problem. In the file src/bin/psql/common.c there is a function called static bool DescribeQuery(const char *query, double *elapsed_msec) (on or about line #1248).
This function has a rather ingenious solution for this very problem. They create an "unnamed" prepared statement, which they do not execute, instead they use the results of that call, which provides the field info, and subsequently query against the pg_catalog for the metadata descriptions (source below).
Unfortunately, the pqxx library doesn't provide enough (that I can tell) to duplicate this functionality. Does anyone have a solution for this problem?
Credit for resolution goes to Jeroen Vermeulen, on the Github issue #641. He suggested that the predicate AND false be added to the query to force a prototype result. By doing this an empty resultset is returned, with the fields available for metadata collection.

How exactly does django 'escape' sql injections with querysets?

After reading a response involving security django provides for sql injections. I am wondering what the docs mean by 'the underlying driver escapes the sql'.
Does this mean, for lack of better word, that the 'database driver' checks the view/wherever the queryset is located for characteristics of the query, and denies 'characteristics' of certain queries?
I understand that this is kind of 'low-level' discussion, but I'm not understanding how underlying mechanisms are preventing this attack, and appreciate any simplified explaination of what is occuring here.
Link to docs
To be precise we are dealing here with parameters escaping.
The django itself does not escape parameters values. It uses the API of the driver that in general looks similar to this (see for example driver for postgres or mysql):
driver.executeQuery(
'select field1 from table_a where field2 = %(field2)s', {'field2': 'some value'}
)
The important thing to note here is that the parameter value (which may be provided by the user and is subject to sql injection) is not embedded into the query itself. The query is passed to the driver with placeholders for parameters values and the list or dict of parameters is passed in addition to that.
Driver then can either construct the SQL query with proper escaped values for parameters or use the API provided by the database itself which is similar in functionality (that is it gets query with placeholders and parameters values).
Django querysets use this approach to generate SQL and that what this piece of documentation is trying to say.

Django and Dynamic Example Data

I'm trying to find a way to easily generate an example/demonstration data set from initial_data.json in Django.
Essentially, the fixtures and initial_data.json do exactly what I need, except that the dates are static....
My app uses dates to display/sort otherwise easily generated information (comments, scores etc) and I'd like to create a thorough data set in order to be able to demonstrate the app's functions to prospective clients; the problem arises with the dates. Even if I run syncdb (which automatically includes my initial_data.json), the dates are static, so all the information will relate to those specific dates, rather than to today. As time passes, that data will become less visible in the app and will therefore not fully demonstrate it's abilities to potential clients.
Is there an easy way to update date information in initial_data.json so that dates remain relevant to the current real date and I can then run syncdb again with those new dates? (Assume that this is all on a local machine merely as a demonstration to clients... Not on a server, production or otherwise).
I hope this makes sense?!
you might be better off writing a function (maybe a management command) to generate some dummy data and save to your (temporary?) database
OK, my solution was to use django-mockups: https://github.com/sorl/django-mockups
It adds random data to your tables (all of them or only those specified by the user) by obeying the field types (text, email, url etc) and the max_length specified in those fields. Inserts Lorem Ipsum and inserts correctly formatted email address etc etc
Very easy to use, can be set to run through a cron job, or can be run manually as and when required. Perfect.

How to solve two REST problems: the interface document; loss of privacy in descriptive URLs

Coming from a lot of frustrating times with WSDL/Soap, I very much like the REST paradigm, but am trying to solve two basic problems in our application, before moving over to REST. The first problem relates to the lack of an interface document. I think I finally see how to handle this situation: One can query his way down from a top-level "/resources" resource using various requests of GET, HEAD, and OPTIONS to find the one needed resource in the correct hypermedia format. Is this the idea? If so, the client need only be provided with a top-level resource URI: http://www.mywebservicesite.com/mywebservice/resources. He will then have to do some searching and possible keep track of what he is discovering, so that he can use the URIs again efficiently in future to do GETs, POSTs, PUTs, and DELETEs. Are there any thoughts on what should happen here?
The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber. We do have an implementation of opaque URLs we use in the context of a session, and I'm wondering how opaque URLs might be applied to REST. The general problem is how to keep domain-specific details out of URLs, and still benefit from what REST has to offer.
The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber.
I think you've misunderstood the point of opaque URIs. The notion of opaque URIs is with respect to clients: A client shall not decipher a URI to guess anything of semantic meaning from it. So a service may well have URIs like /resources/.../customer/Madonna/phonenumber, and that's quite a good idea. The URIs should be treated as opaque by clients: not infer from the URI that it represents Madonna's phone number, and that Madonna is a customer of some sort. That knowledge can only be obtained by looking inside the URI itself, or perhaps by remembering where the URI was discovered.
Edit:
A consequence of this is that navigation should happen by links, not by deconstructing the URI. So if you see /resouces/customer/Madonna/phonenumber (and it actually represents Customer Madonna's phone number) you should have links in that resource to point to the Madonna resource: e.g.
{
"phone_number" : "01-234-56",
"customer_URI": "/resources/customer/Madonna"
}
That's the only way to navigate from a phone number resource to a customer resource. An important aspect is that the server implementation might or might not have domain specific information in the URI, The Madonna record might just as well live somewhere else: /resources/customers/byid/81496237. This is why clients should treat URIs as opaque.
Edit 2:
Another question you have (in the comments) is then how a client, with the required no knowledge of the server's URIs is supposed to be able to find anything. Clients have the following possibilities to find resources:
Provide a search interface. This could be done by providing an OpenSearch description document, which tells clients how to search for items. An OpenSearch template can include several variables, and several endpoints, depending on what you're looking for. So if you have a "customer ID" that's unique, you could have the following template: /customers/byid/{proprietary:customerid}", the customerid element needs to be documented somewhere, inside the proprietary namespace. A client can then know how to use such a template.
Provide a custom form. This implies making a custom media type in which you explicitly define how (based on an instance of the document) a URI to a customer can be forged. <customers template="/customers/byid/{id}"/>. The documentation (for the media type) would have to state that the template attribute must be interpreted as a relative URI after the string substitution "{id}" to an actual customer ID.
Provide links to all resources. Some resources aren't innumerable, so you can simply make a link to each and every one of them, optionally including identifying information along with the links. This could also be done in a custom media type: <customer id="12345" href="/customer/byid/12345"/>.
It should be noted that #1 and #2 are two ways of saying the same thing: Clients are allowed to create URIs if they
haven't got the URI structure a priori
a media type exists for which the documentation states that URIs should be created
This is much the same way as a web browser has no idea of any URI structure on the web, except for the rules laid out in the definition of HTML forms, to add a ? and then all the query parameters separated by &.
In theory, if you have a customer with id 12345, then you could actually dispense with the href, since you could plug the customer id 12345 into #1 or #2. It's more common to actually provide real links between resources, rather than always relying on lookup or search techniques.
I haven't really used web RPC systems (WSDL/Soap), but i think the 'interface document' is there mostly to allow client libraries to create the service API, right? if so, REST shouldn't need it, because the verbs are already defined and don't really need to be documented again.
AFAIUI, the REST way is to document the structure of each resource (usually encoded in XML or JSON). In that document, you'll also have to document the relationship between those resources. In my case, a resource is often a container of other resources (sometimes more than one type), therefore the structure doc specifies what field holds a list of URLs pointing to the contained resources. Ideally, only one unique resource will need a single, fixed (documented) URL. everithing else follows from there.
The URL 'style' is meaningless to the client, since it shouldn't 'construct' an URL. Every URL it needs should be already constructed on a resource field. That let's you change the URL structure without changing the client (that has saved tons of time to me). Your URLs can be as opaque or as descriptive as you like. (personally, i don't like text keys or slugs; my keys are all BIGINTs or UUIDs)
I am currently building a REST "agent" that addresses the first part of your question. The agent offers a temporary bookmarking service. The client code that is interacting with the agent can request that an URL be bookmarked using some identifier. If the client code needs to retrieve that representation again, it simply asks the agent for the url that corresponds to the saved bookmark and then navigates to that bookmark. Currently those bookmarks are not persisted so they only last for the lifetime of the client application, but I have found it a useful mechanism for accessing commonly used resources. E.g. The root representation provides a login link. I bookmark that link and if the client ever receives a 401 then I can redirect to the "login" bookmark.
To address an issue you mentioned in a comment, the agent also has the ability to store retrieved representations in a dictionary. If it becomes necessary to aggregate and manipulate multiple representations at the same time then I can simply request that the agent store the current representation in a dictionary associated to a key and then continue navigating to the next resource. Once the client has accumulated all the necessary representation it can do what it needs to do.

Detail question on REST URLs

This is one of those little detail (and possibly religious) questions. Let's assume we're constructing a REST architecture, and for definiteness lets assume the service needs three parameters, x, y, and z. Reading the various works about REST, it would seem that this should be expressed as a URI like
http://myservice.example.com/service/ x / y / z
Having written a lot of CGIs in the past, it seems about as natural to express this
http://myservice.example.com/service?x=val,y=val,z=val
Is there any particular reason to prefer the all-slashes form?
The reason is small but here it is.
Cool URI's Don't Change.
The http://myservice.example.com/resource/x/y/z/ form makes a claim in front of God and everybody that this is the path to a specific resource.
Note that I changed the name. There may be a service involved, but the REST principle is that you're describing a specific web resource, named /x/y/z/.
The http://myservice.example.com/service?x=val,y=val,z=val form doesn't make as strong a claim. It says there's a piece of code named service that will try to do some sort of query. No guarantees.
Query parameters are rarely "cool". Take a look at the Google Chart API. Should that use a /full/path/notation for all of the fields? Would each URL be cool if it did?
Query parameters are useful. Optional fields can be omitted. New keys can be added to support new functionality. Over time, old fields can be deprecated and removed. Doing this is clumsier with a /path/notation .
Quoting from http://www.xml.com/pub/a/2004/08/11/rest.html
URI Opacity [BP]
The creator of a URI decides the encoding
of the URI, and users should not derive
metadata from the URI itself. URI opacity
only applies to the path of a URI. The
query string and fragment have special
meaning that can be understood by users.
There must be a shared vocabulary between
a service and its consumers.
This sounds like query strings are what you want.
One downside to query strings is that the are unordered. The GET ending with "?x=1&y=2" is different than that ending with "?y=2&x=1". This means the browser and any other intermediate systems won't be able to cache it, because caching is done based on the full URL. If this is a concern, then generate the query string in a well-defined order.
While constructing URIs this is the priniciple I follow. I don't know whether it is perfectly acceptable in all cases
Say for instance, that I have to get the details of an employee, then the URI will be of the form:
GET /employees/1/ and not GET /employees?id=1 since I treat every employee as a resource and the whole URI "employees/{id}" is used in identification of the resource.
On the other hand, if I have algorithmic operations that do not identify a specific resource as such,but merely require inputs to the algorithm which in turn identify the resource, then I use query strings.
For instance GET /employees?empname='%Bob%'&maxResults=100 might give me all employees whose names have the word Bob in them, with the maximum results returned by the query limited to 100.
Hope this answers your question
URIs are strictly split into a hierarchical part (the path) and a non-hierarchical path (the query), and both serve to identify the resource
Tthe URI spec itself (RFC 3986) clearly sets the path and the query portion of a URI as equal.
Section 3.3:
The path component contains data [...] that along with [the] query component
serves to identify a resource.
Section 3.4:
The query component contains [...] data that, along with
[...] the path component serves to identify a resource
So your choice in using x/y/z versus x=val&y=val&z=val has mainly to do if x, y or z are hierarchical in nature or if they're non-hierarchical, and if you can perceive them as always being hierarchical or non-hierarchical for the foreseeable future, along with any technical limitations you might be having on selecting one over the other.
But to answer your question, as others have noted: Neither is more RESTful than the other, since they both end up identifying a resource.
If the resource is the service, independent of parameters, it should be
http://myservice.example.com/service?x=val&y=val&z=val
This is a GET query. One of the principles behind REST is that you GET to read (but not modify!) the resource; you can POST to modify a resource & get a response; you can PUT to write to a resource; and you can DELETE to remove a resource.
If the resource specific with those parameters is a persistent resource, it needs a name. You could (if you organized your webservice this way) POST to http://myservice.example.com/service?x=val&y=val&z=val to create a particular instance of the service and have it return an ID to name this instance, e.g.
http://myservice.example.com/service/12312549
then use GET/POST/PUT/DELETE to interact with that instance.
First of all, defining URIs as part of your API violates a constraint of the REST architecture. You cannot do that and call your API RESTful.
Secondly, the reason query parameters are bad for non-query resource access is that they are generally not cached. It is also a violation of HTTP standards.
A URL with slashes like /x/y/z/ would impose a hierarchy and is not suited for the exact case of just passing three parameters.
If, like you said, x y z are indeed just parameters and the order is not important, it would be more RESTful to use semicolons:
http://myservice.example.com/service/x;y;z/
If your "service" however is just an algorithm that works the same with different parameters, there would also be nothing unRESTful with using ?x=val format.