Save CSV data into the datastore whilst harvesting in CKAN - datastore

I'm building a custom harvester for importing data from an external site into CKAN (version 1.8).
It works pretty well and creates the metadata and the resources associated with it. I'd like to aggregate this resources and create a new CSV to save it in the Datastore while harvesting in the import phase.
I know I can use the DataStore API but I'd prefer not to use HTTP (it makes no sense to me to give an API key / user / URL / ... to an harvester that has permissions to add stuff)
Is it possible to call the DataStore API functions directly from the harvester?
https://github.com/okfn/ckan/blob/master/ckanext/datastore/logic/action.py
Every function takes a context parameter which is not documented.

You have a couple of distinct things you are doing here:
Converting CSV to appropriate python (or JSON) structure for insertion in the datastore
Inserting into the datastore
For the latter you can use either:
The logic actions (direct)
The DataStore API
The API just calls the logic actions (plus does auth) so these are pretty similar but the logic approach will likely be faster and could be more natural if you are already doing code. That said the API could be conceptually cleaner as you have nice boundaries to your different components in the form of defined web apis.
For the former (i.e. conversion of CSV to JSON) recommend you use the Data Converters library, especially the commas.py part which converts to exactly the format you need. There is a full web service being developed based on Data Converters but it is not yet fully operational.

I solved this by using ckanext-datastorer (for the DataStore) and ckanclient (for uploading the file)
ckanclient is bugged with CKAN 1.8 because it doesn't handle redirects correctly. We solved with this bleeding and dirty patch https://gist.github.com/mammadori/4945812
A better fix would be completely drop urllib and change the whole ckanclient to use requests instead.
Thanks for your support

Related

Updating all entities of KIND in Google Cloud Datastore

we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.

What is the "proper" way to use DynamoDB for an iOS app?

I've just started messing around with AWS DynamoDB in my iOS app and I have a few questions.
Currently, I have my app communicating directly to my DynamoDB database. I've been reading around lately and people are saying this isn't the proper way to go about getting data from my database.
By this I mean is I just have a function in my code querying my Dynamo database and returning the result.
How I do it works but is there a better way I should be going about this?
Amazon DynamoDB itself is a highly-scalable service and standing up another server in front of it requires scaling the service also in line with the RCU/WCU configured for your tables, which we can and should avoid.
If your mobile application doesn't need a backend server and you can perform all the business functions from the mobile device, then you should probably think about
Using the AWS DynamoDB SDK for iOS devices to write your client application that runs on the mobile device
Use AWS Token Vending Machine to authenticate your mobile users to grant them credentials to be used to run operations on DynamoDB tables.
Control access (i.e what operations should be allowed on tables etc.,) using IAM policies.
HTH.
From what you say, I can guess that you are talking about a way you can distribute data to many clients (ios apps).
There are few integration patterns (a very good book on this: Enterprise Integration Patterns), one of which is called shared database. It is essentially about using a common database for multiple clients to share the data. Main drawback for that pattern (in your case) is that you are doing assumption about how the database schema looks like. It can potentially bring you some headache supporting the schema in the future, if your business logic changes.
The more advanced approach would be sending events on every change in your data instead of directly writing changes to the database from client apps. This way you can add additional processing to the events before the data they carry is written to the database. For example, you may want to change the event format in the new version of your app, but still want to support legacy users, so you add translation procedure which transforms both types of events to the format which fits the database schema. It's basically a question of whether to work with diffs vs snapshots.
You should be aware of added complexity of working with events, and it can be an overkill if your app is simple and changes in schema are unlikely.
Also consider that you can do data preprocessing using DynamoDB Streams, which gives you some advantages of using events still keeping it simple to implement.

Is it possible to use Django and Node.Js?

I have a django backend set up for user-logins and user-management, along with my entire set of templates which are used by visitors to the site to display html files. However, I am trying to add real-time functionality to my site and I found a perfect library within Node.Js that allows two users to type in a text box and have the text appear on both their screens. Is it possible to merge the two backends?
It's absolutely possible (and sometimes extremely useful) to run multiple back-ends for different purposes. However it opens up a few cans of worms, depending on what kind of rigour your system is expected to have, who's in your team, etc:
State. You'll want session state to be shared between different app servers. The easiest way to do this is to store external session state in a framework-agnostic way. I'd suggest JSON objects in a key/value store and you'll probably benefit from JSON schema.
Domains/routing. You'll need your login cookie to be available to both app servers, which means either a single domain routed by Apache/Nginx or separate subdomains routed via DNS. I'd suggest separate subdomains for the following reason
Websockets. I may be out of date, but to my knowledge neither Apache nor Nginx support proxying of websockets, which means if you want to use that you'll sacrifice the flexibility of using an http server as a app proxy and instead expose Node directly via a subdomain.
Non-specified requirements. Things like monitoring, logging, error notification, build systems, testing, continuous integration/deployment, documentation, etc. all need to be extended to support a new type of component
Skills. You'll have to pay in time or money for the skill-sets required to manage a more complex application architecture
So, my advice would be to think very carefully about whether you need this. There can be a lot of time and thought involved.
Update: There are actually companies springing around who specialise in adding real-time to existing sites. I'm not going to name any names, but if you look for 'real-time' on the add-on marketplace for hosting platforms (e.g. Heroku) then you'll find them.
Update 2: Nginx now has support for Websockets
You can't merge them. You can send messages from Django to Node.Js through some queue system like Reddis.
If you really want to use two backends, you could use a database that is supported by both backends.
Though I would not recommended it.

REST and web services - having trouble understanding them

Well, the title more or less says it all. I sort of understand what REST is - the use of existing HTTP procedures (POST, GET, etc.) to facilitate the creation/use of web services. I'm more confused on what defines what a web service is and how REST is actually used to make/expose a service.
For example, Twitter, from what I've read, is RESTful. What does that actually mean? How are the HTTP procedures invoked? When I write a tweet, how is REST involved, and how is it any different than simply using a server side language and storing that text data in a database or file?
This concept is also a bit vague to me but after looking at your question I decided to clarify it a bit more for myself.
please refer to this link in msdn and this.
Basically it seems that it is about using http methods (Get/Post/Delete) to identify the exposure of resources the application allows.
For example:
Say you have the URL :
http://Mysite.com/Videos/21
where 21 is an id of a video.
We can further define what methods are allowed to this url - GET for retrieval of the resource, POST for updating/ Creating, Delete for deleting.
Generally, it seems like an organized and clean way to expose your application resources and the operations that are supported on them with the use of http methods
You may want to start with this excellent introductionary writeup. It covers it all, from start to end.
A RESTful architecture provides a unique URL that defines each resource individually and infers through HTTP action verbs the appropriate action to do for the same URL.
The best way I can think to explain it is to think of each data model in your application as a unique resource, and REST helps route the requests without using a query string at the end of the url, e.g., instead of /posts/&q=1, you'd just use posts/1.
It's easier to understand through example. This would be the REST architecture enforce by Rails, but gives you a good idea of the context.
GET /tweets/1 ⇒ means that you want to get the tweet with the id of 1
GET /tweets/1/edit ⇒ means you want to go to the action edit that is associated with the tweet with an id of 1
PUT /tweets/1 ⇒ PUT says to update this tweet not fetch it
POST /tweets ⇒ POST says i got a new one, add it to the db, i cant give an id cuz i dont have one yet until i save it to the db
DELETE /tweets/1 ⇒ delete it from the DB
Resources are often nested though so in twitter it might be like
GET /users/1/jedschneider/1 ⇒ users have many tweets; get the user with id of jedschneider and its tweet id 1
The architecture for implementing REST will be unique to the application, with some apps supporting by default (like Rails).
You're struggling because there are two relatively different understandings of the term "REST". I've attempted to answer this earlier, but suffice to say: Twitter's API isn't RESTful in the strict sense, and neither is Facebook's.
sTodorov's answer shows the common misunderstanding that it's about using all four HTTP verbs, and assigning different URIs to resources (usually with documentation of what all the URIs are). So when Twitter is invoking REST they're merely doing just this, along with most other RESTful APIs.
But this so-called REST is no different than RPC, except that RPC (with IDLs or WSDLs) might introduce code generation facilities, at the cost of higher coupling.
REST is actually not RPC. It's an architecture for hypermedia based distributed systems, which might not fit the bill for everyone making an API. In the linked MSDN article, the hypermedia kicks in when they talk about <Bookmarks>http://contoso.com/bookmarkservice/skonnard</Bookmarks>, the section ends with this sentence:
These representations make is possible to navigate between different types of resources
which is the core principle that most RESTful APIs violate. The article doesn't state how to document a RESTful API and if it did so, it would be a lot clearer that clients would have to navigate links in order to do things (RESTful), and not be provided with a lot of URI templates (RPCish).

XCelsius using ZOHO webservice

I need to make a dashboard application using data from http://www.projects.zoho.com
It is a project management site.
ZOHO provides data about projects by APIs available at http://www.zoho.com/projects/developers/projects-api.html
So can I use XCelsius engage to make my dashboard?
Is it feasible & advisable?
Also tell me if any other tool like XCelsius is more suitable for me....
expecting satisfactory answers....
This shouldn't be a problem as long as the results returned via XML aren't too complex. Unfortunately Xcelsius has a hard time dealing with nested XML tags of more than a few levels so it is best to ensure that you try to conform data to a table structure. Taking this into consideration depending on complexity you may or may not have to massage data received from ZOHO prior to loading it into Xcelsius.
You also need to be mindful of flash domain security practices if you are not already aware of them.