Spark: best practice to use a web-service from a spark app - web-services

I want to add some geographical information to a dataframe. i.a. for thousands of row, I want to call a web-service, with latitude and longitude as parameter, the WS returning the value of a new field. I should use a UDF to do it.
However, I feel there will be a big performance issue; the web-service may be over-used and slow-down the whole application.
Are there best practices regarding the use of a WS from a spark-application?

Related

Can Datomic simplify querying data contained in dynamically accessed HTML documents?

I need to write an API which would provide access to data being served as HTML documents from a web server. I need for my users to be able to perform queries over the data.
Say on a web site there is a page which lists items and their owners. Then there is additional set of profile pages for owners which for each owner provide information about their reputation. An example query I may need to answer is "Give me ID's and owners of all items submitted in 2013 whose owners have reputation of at least 10".
Given a query to answer, I need to be able to screen scrape only the parts of the web site I need for answering the query at hand. And ideally cache the obtained information for future use with new queries.
I have no problem writing the screen scraping part, but I am struggling with designing the storage/query/cache part. Is there something about Clojure/Datomic that makes it an especially suitable technology choice for this kind of processing of data? I have been pointed in this direction before.
It seems a nice challenge but not sure about a few things: a) would you like to expose to your users a Datalog query box and so make them learn datalog-like syntax? b) what exact kind of results do you wish to cache, raw DB responses, html fomatted text, json ?
Anyway I suggest you to install and play a little bit with the Datomic console to get a grasp if you didn't before as it seems to me the more close idea to what you want to achieve atm https://www.youtube.com/watch?v=jyuBnl0XQ6s http://blog.datomic.com/2013/10/datomic-console.html
For the API I suggest you to use http://clojure-liberator.github.io/liberator/ as it provides sane defaults to implement REST services and let you focus on your app behaviour

Does Biztalk Server support data exchange without use of web services

As I have very little knowledge on how ESB's work in tandem with database I'm asking a question regarding how communication can take place between the two hoping I'll atleast be pointed in the right direction to search in!
SITUATION : We have two systems(one of them is the client's) on different networks which have their own databases. We are required to do a regular real-time data exchange of all points present in our database with the other. We are also required to have a provision to be abel to import data into our system. This exchange has to follow SOA functionality over customer provided Biztalk ESB.We are supposed to provide the exchange by the use of ODBC.
Question: My query is whether it is possible to integrate the databases to the ESB as some endpoints without making any use of WEBSERVICES or extra interfaces, and send the data over the ESB as a pull-push transfer mechanism?
I have tried searching the net for this situation but have not come up with a lot of straightforward answers. Could someone please point me in the right direction.
ESB Toolkit in BizTalk is not an ESB! It is just small additional tool for some special cases.
Let's stop talk about the ESB, we need to solve the technical problem, right?
As I can understand you have two SQL databases and want to integrate them.
To do so with BizTalk the easiest way is to use the WCF-SQL ports/adapters.
You start the Wizards for this adapter, choose the tables/sp-s which should provide data/consume data, the Wizard will generate all needed Xml schemas for you.
Then you will use BizTalk Mapper to create the Xslt maps, which will transfer one SQL data format to another.
They you will create a pair of ports. One will consume data from one SQL database, the second will insert data to another SQL database. One of this port will use the mentioned above Xslt map.
If you need more processing, you could create and orchestration to manage additional processing, sophisticated error handling, etc.
I would recommend using MSMQ. There's a fairly detailed description of it here

Simple ETL: Smooks or ETL product

I am fairly new to the subject and doing some research.
I have an ESB (using WSO2 ESB) and want to extract master data from the passing messages (like Customers, Orders, etc) and store them in DB to keep as a reference data. Source data is in XML coming from web services.
So there needs to be a component that will be able to maintain master data: insert new objects, delete old and update changed (would be also nice to have data events so ESB can route data accordingly).Basically, the logic will be similar for any entity type and it might be good idea to autogenerate it for all new entity types...
Options as I see them now:
Use Smooks with either SQLExecutor or Hibernate for persistence with all matching logic written either in smooks config or in DAO annotations
Use some open source ETL tool (like Talend, Kettle, Clover, etc). So the data will be passed to the ETL and all transformation logic is defined there. Also could accommodate future scenarios when they appear or can be an overkill..
.
Would appreciate if you share your thoughts and point me to the right direction.
You'd better to leave your database part to another tool.
If you have a fair amount of database interactions in your message flow, you can expect serious decreases in your performance.
However you do not need an ETL for the use case you explained. You can simply do it using WSO2 DSS by creating services to insert or update your data inside the database.
We have been using this for message logging purposes (inside DB) beside the ESB and are happy with that. It's better to use it as non-blocking fire-and-forget web services in your message flow within ESB. Hope this helps.

Finding latitude and longitude of many places once

I have a long list of towns and cities, and I'd like to add latitude and longitude information to each of them.
Does anyone know the easiest way to generate this information once?
See also Geocode multiple addresses
The first part of the third video shows how to get latitude and Longitude using Google Refine and geocoding. No need to write a new script. Ideal for doing this kind of change once.
http://code.google.com/p/google-refine/
Or use www.geonames.org - there's language APIs for that. Or Open Street Map's Nominatim: http://wiki.openstreetmap.org/wiki/Nominatim - google have slightly more restrictive terms of service.
You can use the Google Geocoding API. Check the API at this URL: http://code.google.com/apis/maps/documentation/geocoding/
What follows next is writing some code. I am doing something similar in C# and it is quite easy here.
Most geocoding services can handle queries with only administrative names which is what you're after, e.g., municipality and region. So I'd choose one you like that also handles batch or bulk requests, e.g., the Bing Spatial Data API (here's an article on batch geocoding with it.)
An alternative approach that might be useful if you're on a budget and have a lot of these to do would be to download the Geonames database and write a bit of code to import it into your database or index it; then query it however and how often you like, e.g., if you put your places in another table you could SELECT [...] FROM my_places LEFT JOIN geonames [...]. I used to import Geonames DB into a vanilla PostgreSQL nightly and probably still have the code in a git repo somewhere if that's a route you want to try (comment and I'll find it and attach.)
For a service that uses google, which I find most accurate.
Look at http://www.torchproducts.com/tools/geocode

XCelsius using ZOHO webservice

I need to make a dashboard application using data from http://www.projects.zoho.com
It is a project management site.
ZOHO provides data about projects by APIs available at http://www.zoho.com/projects/developers/projects-api.html
So can I use XCelsius engage to make my dashboard?
Is it feasible & advisable?
Also tell me if any other tool like XCelsius is more suitable for me....
expecting satisfactory answers....
This shouldn't be a problem as long as the results returned via XML aren't too complex. Unfortunately Xcelsius has a hard time dealing with nested XML tags of more than a few levels so it is best to ensure that you try to conform data to a table structure. Taking this into consideration depending on complexity you may or may not have to massage data received from ZOHO prior to loading it into Xcelsius.
You also need to be mindful of flash domain security practices if you are not already aware of them.