Nutch or other framework to crawl webservices - web-services

I'm looking for a framework what I can use for the following scenario: I have 2 web-services. I call the first service which has json response. In the json response I have some Ids, which I use to call other services and then I merge the services responses and store it in db. I want to call these services every day to update my db.
What I found is Nutch, but it looks like it is a webcrawler for mostly html pages. Is there any framework that I can use for the scenario above? I'm looking for a fault tolerant salable java framework.
Thanks!

You could use Nutch, it is not limited to HTML. If something can be accessed via a URL then Nutch will fetch it, however you might need to implement some custom parsers and indexers to deal with your content.
Alternatively storm-crawler would be both scalable and customisable. You might find it easier to learn than Nutch and more flexible. In your use case you could have one or more queues (e.g. RabbitMQ, AWS SQS, etc...) in front of SC. The seed URLs would be the ones to use on the first service and you could have custom parse filters to generate the URLs for the second one. Finally you'd have a bespoke indexing bolt sending the data to persist to the DB. There's loads of resources available for Storm you could piggy back.
HTH

Related

Send Complex data to Restful Web Service -- API design

I am trying to convert some of the SOAP based web services to Restful web services. In one of our existing SOAP based web service, we pass in a RequstDTO and the web service returns a ResponseDTO. The RequestDTO and ResponseDTO are both complex Java classes, which contain other custom JavaBean classes inside. It is a "READ" operation, so it naturally maps to the "GET" REST operation. Converting the ResponseDTO into XML or JSON has no issue. But I am not sure how to convert the RequestDTO into RESTful API.
The URL is going to be quite long, if I convert all the data in the RequestDTO into the query string. RESTful web service is usually consumed by application, thus the browser URL length limitation does not really apply. But a short URL is still preferred in most cases.
Some attributes in the RequestDTO might have PHI sensitive information and I prefer not to put them in the URL.
One solution is to embed the request data in the request body, even though it is a GET operation. But based on my research, such way is discouraged
http://tech.groups.yahoo.com/group/rest-discuss/message/9962
So what is the alternative? What is the right way to design this?
Im not exactly sure why you would need to pass the requestDTO to make the REST call.
Normally you just do something like this
GET /Resource/id
Now if the resource you want is like a secondary resource.. Example, you have a User and credit cards belonging to a user.
GET /User/{user_id}
GET /User/{user_id}/CreditCards/{credit_card_id}
And of course this can be nested however many times you want.

calling web services from UNIX

I have a requirement to kickoff a workflow which is in salesforce.com thorugh web service from UNIX box. Can any one suggest me options or guide lines to achive this scenario?
I don't think you can just "kick off workflows". You'll have to perform an insert or update of records in Salesforce that will satisfy the workflow's entry criteria.
There's a Java tool called Data Loader for your basic data manipulation activities (you can download it from your own production org)
and it can be scripted for scheduled runs, has config file where you can store user's password in secure way etc. Check out the pdf guide for more ("Command Line Quick Start" chapter)
So I don't think you really need a webservice call...
Unless I misunderstood and you're talking about calling an Apex class' method that has "webservice" keyword and it will somehow perform the updates?
In that case you'll need to download the WSDL file generated for this class (Setup->Develop->Classes) and well, consume it in language of your choice (Java, PHP, Python... this link will help, steps aren't too different), then do your command line magic?
http://wiki.developerforce.com/page/Integration has tons of resources for you :)
Salesforce uses SOAP for their web service. They don't have restful web services now. Just request them to give the wsdl file.
Use this wsdl file to generate the java code. After that get their webService url so that you can proceed with your data pulling
This link may help you..
http://salesforce-walker.blogspot.in/2011/12/to-access-salesforce-data-from-java-we.html
Hope this helps

RESTful Dictionary Service?

this is only loosely programming related.
I wrote myself a shellscript, that extracts all acronyms from a text and writes them to a file. Now I would like to process that file to add the definitions.
My first google hit suggested using curl and the dict:// url scheme. However I am behind a proxy, which does not seem to allow that.
Does any of you know a service that is similar to the dict:// but is provided via HTTP?
Ideally it would be restful since messing around with SOAP seems somewhat bloated for this task.
There are plenty of Dictionary API services listed on http://www.programmableweb.com

Fetching remote database info from a client application

What would be the preferred method of pulling content from a remote database?
I don't think that I would want to pull directly from the database for a number of reasons.
(Such as easily being able to change where it is fetching the info from and a current lack of access from outside the server.)
I've been thinking of using HTTP as a proxy to the database basically just using some PHP to display raw text from the database and then grabbing the page and dumping it to a string for displaying.
I'm not exactly sure how I would go about doing that though. (Sockets?)
Right now I am building it around a blog/news type system. Though the content would expand in the future.
I've got a similar problem at the moment, and the approach I'm taking is to communicate from the client app with a database via a SOAP web service.
The beauty of this approach is that on the client side the networking involved consists of a standard HTTP request. Most platforms these days include an API to perform basic HTTP client functions. You'll then also need an XML or JSON parser to parse the returned SOAP data, but they're also readily available.
As a concrete example, a little about my particular project: It's an iPhone app communicating with an Oracle database. I use a web service to read data from the database and send the data to the app formatted in XML using SOAP. The app can use Apple's NSURLConnection API to perform the necessary HTTP request. The XML is then parsed using the NSXMLParser API.
While the above are pretty iPhone-specific (and are Objective-C based) I think the general message still applies - there's tools out there that will do most of the work for you. I can't think of an example of an HTTP API offhand, but for the XML parsing part of the equation there's Xerces, TinyXML, Expat...
HTH!
You might look at using AJAX (I recommend JSON instead of XML though). This is the technology underlying Google Maps.

Calling REST web services from a classic asp page

I'd like to start moving our application business layers into a collection of REST web services. However, most of our Intranet has been built using Classic ASP and most of the developers where I work keep programming in Classic ASP. Ideally, then, for them to benefit from the advantages of a unique set of web APIs, it would have to be called from Classic ASP pages.
I haven't the slightest idea how to do that.
You could use a combination of JQuery with JSON calls to consume REST services from the client
or
if you need to interact with the REST services from the ASP layer you can use
MSXML2.ServerXMLHTTP
like:
Set HttpReq = Server.CreateObject("MSXML2.ServerXMLHTTP")
HttpReq.open "GET", "Rest_URI", False
HttpReq.send
#KP
You should actually use MSXML2.ServerXMLHTTP from ASP/server side applications. XMLHTTP should only be used client side because it uses WinInet which is not supported for use in server/service apps.
See http://support.microsoft.com/kb/290761, questions 3, 4 & 5 and
http://support.microsoft.com/kb/238425/.
This is quite important, otherwise you'll experience your web app hanging and all sorts of strange nonsense going on.
Here are a few articles describing how to call a web service from a class ASP page:
Integrating ASP.NET XML Web Services with 'Classic' ASP Applications
Consuming XML Web Services in Classic ASP
Consuming a WSDL Webservice from ASP
A number of the answers presented here appear to cover how ClassicASP can be used to consume web-services & REST calls.
In my opinion a tidier solution may be for your ClassicASP to just serve data in REST formats. Let your browser-based client code handle the 'mashup' if possible. You should be able to do this without incorporating any other ASP components.
So, here's how I would mockup shiny new REST support in ClassicASP:
provide a single ASP web page that acts as a landing pad
The landing pad will handle two parameters: verb and URL, plus a set of form contents
Use some kind of switch block inspect the URL and direct the verb (and form contents) to a relevant handler
The handler will then process the verb (PUT/POST/GET/DELETE) together with the form contents, returning a success/failure code plus data as appropriate.
Your landing pad will inspect the success/failure code and return the respective HTTP status plus any returned data
You would benefit from a support class that decodes/encodes the form data from/to JSON, since that will ease your client-side implementation (and potentially streamline the volume of data passed). See the conversation here at Any good libraries for parsing JSON in Classic ASP?
Lastly, at the client-side, provide a method that takes a Verb, Url and data payload. In the short-term the method will collate the parameters and forward them to your landing pad. In the longer term (once you switch away from Classic ASP) your method can send the data to the 'real' url.
Good luck...
Another possible solution is to write a .NET DLL that makes the calls and returns the results (maybe wrap something like RESTSharp - give it a simple API customized to your needs). Then you register the DLL as a COM DLL and use it in your ASP code via the CreateObject method.
I've done this for things like creating signed JWTs and salting and hashing passwords. It works nicely (while you work like crazy to rewrite the ASP).
Another possibility is to use the WinHttp COM object Using the WinHttpRequest COM Object.
WinHttp was designed to be used from server code.
All you need is an HTTP client. In .Net, WebRequest works well. For classic ASP, you will need a specific component like this one.