GATE Embedded runtime - web-services

I want to use "GATE" through web. Then I decide to create a SOAP web service in java with help of GATE Embedded.
But for the same document and saved Pipeline, I have a different run-time duration, when GATE Embedded runs as a java web service.
The same code has a constant run-time when it runs as a Java Application project.
In the web service, the run-time will be increasing after each execution until I get a Timeout error.
Does any one have this kind of experience?
This is my Code:
#WebService(serviceName = "GateWS")
public class GateWS {
#WebMethod(operationName = "gateengineapi")
public String gateengineapi(#WebParam(name = "PipelineNumber") String PipelineNumber, #WebParam(name = "Documents") String Docs) throws Exception {
try {
System.setProperty("gate.home", "C:\\GATE\\");
System.setProperty("shell.path", "C:\\cygwin2\\bin\\sh.exe");
Gate.init();
File GateHome = Gate.getGateHome();
File FrenchGapp = new File(GateHome, PipelineNumber);
CorpusController FrenchController;
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
FrenchController.setCorpus(corpus);
File docFile = new File(GateHome, Docs);
Document doc = Factory.newDocument(docFile.toURL(), "utf-8");
corpus.add(doc);
FrenchController.execute();
String docXMLString = null;
docXMLString = doc.toXml();
String outputFileName = doc.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
out = new OutputStreamWriter(bos, "utf-8");
out.write(docXMLString);
out.close();
gate.Factory.deleteResource(doc);
return outputFileName;
} catch (Exception ex) {
return "ERROR: -> " + ex.getMessage();
}
}
}
I really appreciate any help you can provide.

The problem is that you're loading a new instance of the pipeline for every request, but then not freeing it again at the end of the request. GATE maintains a list internally of every PR/LR/controller that is loaded, so anything you load with Factory.createResource or PersistenceManager.loadObjectFrom... must be freed using Factory.deleteResource once it is no longer needed, typically using a try-finally:
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
try {
// ...
} finally {
Factory.deleteResource(FrenchController);
}
But...
Rather than loading a new instance of the pipeline every time, I would strongly recommend you explore a more efficient approach to load a smaller number of instances of the pipeline but keep them in memory to serve multiple requests. There is a fully worked-through example of this technique in the training materials on the GATE wiki, in particular module number 8 (track 2 Thursday).

Related

Textract Form Analysis, Java SDK 1.x

I'm looking to extract form data utilizing textract. I've tested with a PDF in the demo and results are great. Results using the SDK however are far from optimal, actually, completely inaccurate. If I use StartDocumentAnalysisRequest/StartDocumentAnalysisResult (asynchronous), I only get 1 block returned of type PAGE, never KEY_VALUE_SET. If I convert my PDF to an image and use the synchronous methods, I do get KEY_VALUE_SET back but results are completely inaccurate.
Does anyone know how I can utilize the asynchronous analysis functionality to retrieve form values as the documentation indicates?
Sample Code below:
StartDocumentAnalysisResult startDocumentAnalysisResult = amazonTextract.startDocumentAnalysis(req);
String startJobId = startDocumentAnalysisResult.getJobId();
GetDocumentAnalysisResult documentAnalysisResult = null;
String jobStatus = "IN_PROGRESS";
while (jobStatus.equals("IN_PROGRESS")) {
try {
TimeUnit.SECONDS.sleep(10);
GetDocumentAnalysisRequest documentAnalysisRequest = new GetDocumentAnalysisRequest()
.withJobId(startJobId)
.withMaxResults(1);
documentAnalysisResult = amazonTextract.getDocumentAnalysis(documentAnalysisRequest);
jobStatus = documentAnalysisResult.getJobStatus();
} catch (Exception e) {
logger.error(e);
}
}
if (!jobStatus.equals("IN_PROGRESS")) {
List<Block> blocks = documentAnalysisResult.getBlocks();
logger.error("block list size " + blocks.size());
Map<String, Map<String, Block>> keyValueBlockMap = new HashMap<>();
Map<String, Block> keyMap = new HashMap<>();
Map<String, Block> valueMap = new HashMap<>();
Map<String, Block> blockMap = new HashMap<>();
for (Block block : blocks) {
logger.error("Block Type:" + block.getBlockType());
String blockId = block.getId();
blockMap.put(blockId, block);
if (block.getBlockType().equals("KEY_VALUE_SET")) {
if (block.getEntityTypes().contains("KEY")) {
keyMap.put(blockId, block);
} else {
valueMap.put(blockId, block);
}
}
}
keyValueBlockMap.put("keyMap", keyMap);
keyValueBlockMap.put("valueMap", valueMap);
keyValueBlockMap.put("blockMap", blockMap);
Map<String, String> keyValueRelationShip = getKeyValueRelationShip(keyValueBlockMap);
for (String key : keyValueRelationShip.keySet()) {
logger.error("Key: " + key);
logger.error("Value: " + keyValueRelationShip.get(key));
}
}
Synchronous path which results in completely horrible results:
AnalyzeDocumentRequest request = new AnalyzeDocumentRequest() .withFeatureTypes(FeatureType.FORMS) .withDocument(new Document(). withS3Object(new com.amazonaws.services.textract.model.S3Object() .withName(objectName) .withBucket(awsHelper.getS3BucketName())));
AnalyzeDocumentResult result = amazonTextract.analyzeDocument(request);
You are not using the recommended version for the AWS SDK for Java. You are using a old version and not the recommended one.
I have tested the AWS SDK for Java V2 and I am able to get lines and text that lines up with the AWS Management Console.
You can find textTract V2 examples in the repo linked above.
I am able to get to lines and the corresponding text by using software.amazon.awssdk.services.textract.TextractClient.
For example when i debug through the code using the same PNG as I used in the console, i get the proper result.

Call a python code from WCF

I need to make a python code available as WCF for another application to access it. The python code was build by the data science team and have no ability to change it. I tried running the program as a process shell but it gives 'System.InvalidOperationException' exception.
I created the same program as C# console application and it works fine. The question is
a. Is this the right way to go about making python code available to another application (REST API is not an option).
b. What is the issue with my code.
public string ClassifyText(string value)
{
string textoutput = "";
string exeFileName = HttpContext.Current.Server.MapPath("~/python.exe");
string argName = HttpContext.Current.Server.MapPath("~/predictionscript.py");
ProcessStartInfo start = new ProcessStartInfo();
start.FileName = exeFileName;
start.Arguments = argName;
start.UseShellExecute = false;
start.RedirectStandardOutput = true;
using (Process process = Process.Start(start))
{
using (StreamReader reader = process.StandardOutput)
{
string result = reader.ReadToEnd();
textoutput = result;
}
}
return textoutput;
}

Whats the Efficient way to call http request and read inputstream in spark MapTask

Please see the below code sample
JavaRDD<String> mapRDD = filteredRecords
.map(new Function<String, String>() {
#Override
public String call(String url) throws Exception {
BufferedReader in = null;
URL formatURL = new URL((url.replaceAll("\"", ""))
.trim());
try {
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
} finally {
if (in != null) {
in.close();
}
}
}
});
here url is http GET request. example
http://ip:port/cyb/test?event=movie&id=604568837&name=SID&timestamp_secs=1460494800&timestamp_millis=1461729600000&back_up_id=676700166
This piece of code is very slow . IP and port are random and load is distributed so ip can have 20 different value with port so I dont see bottleneck .
When I comment
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
The code is too fast.
NOTE: Input data to process is 10GB. Using spark to read from S3.
is there anything wrong I am doing with BufferedReader or InputStreamReader any alternative .
I cant use foreach in spark as I have to get the response back from server and need to save JAVARdd as textFile on HDFS.
if we use mappartition code something as below
JavaRDD<String> mapRDD = filteredRecords.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
#Override
public Iterable<String> call(Iterator<String> tuple) throws Exception {
final List<String> rddList = new ArrayList<String>();
Iterable<String> iterable = new Iterable<String>() {
#Override
public Iterator<String> iterator() {
return rddList.iterator();
}
};
while(tuple.hasNext()) {
URL formatURL = new URL((tuple.next().replaceAll("\"", ""))
.trim());
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
try(BufferedReader br = new BufferedReader(new InputStreamReader(con
.getInputStream()))) {
rddList.add(br.readLine());
} catch (IOException ex) {
return rddList;
}
}
return iterable;
}
});
here also for each record we are doing same .. isnt it ?
Currently you are using
map function
which creates a url request for each row in the partition.
You can use
mapPartition
Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.
A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.
There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...

How to get value of 'CARBON_HOME' in java code

I was trying to implement an Axis2 service that receives user requests and publishes them as events to a CEP using carbon databridge thrift (via 'org.wso2.carbon.databridge.agent.thrift.DataPublisher')
I followed the code sample provided in wso2cep-3.1.0/samples/producers/activity-monitor
please see the following code snippet
public class GatewayServiceSkeleton{
private static Logger logger = Logger.getLogger(GatewayServiceSkeleton.class);
public RequestResponse request(Request request)throws AgentException,
MalformedStreamDefinitionException,StreamDefinitionException,
DifferentStreamDefinitionAlreadyDefinedException,
MalformedURLException,AuthenticationException,DataBridgeException,
NoStreamDefinitionExistException,TransportException, SocketException,
org.wso2.carbon.databridge.commons.exception.AuthenticationException
{
final String GATEWAY_SERVICE_STREAM = "gateway.cep";
final String VERSION = "1.0.0";
final String PROTOCOL = "tcp://";
final String CEPHOST = "cep.gubnoi.com";
final String CEPPORT = "7611";
final String CEPUSERNAME = "admin";
final String CEPPASSWORD = "admin";
Object[] metadata = { request.getDeviceID(), request.getViewID()};
Object[] correlationdata = { request.getSessionID()};
Object[] payloaddata = {request.getBucket()};
KeyStoreUtil.setTrustStoreParams();
KeyStoreUtil.setKeyStoreParams();
DataPublisher dataPublisher = new DataPublisher(PROTOCOL + CEPHOST + ":" + CEPPORT, CEPUSERNAME, CEPPASSWORD);
//create event
Event event = new Event (GATEWAY_SERVICE_STREAM + ":" + VERSION, System.currentTimeMillis(), metadata, correlationdata, payloaddata);
//Publish event for a valid stream
dataPublisher.publish(event);
//stop
dataPublisher.stop();
RequestResponse response = new RequestResponse();
response.setSessionID(request.getSessionID());
response.setDeviceID(request.getDeviceID());
response.setViewID(request.getViewID());
response.setBucket(request.getBucket());
return response;
}
there is also a utility class that set the key store parameters as following
public class KeyStoreUtil {
static File filePath = new File("../../../repository/resources/security");
public static void setTrustStoreParams() {
String trustStore = filePath.getAbsolutePath();
System.setProperty("javax.net.ssl.trustStore", trustStore + "/client-truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "wso2carbon");
}
public static void setKeyStoreParams() {
String keyStore = filePath.getAbsolutePath();
System.setProperty("Security.KeyStore.Location", keyStore + "/wso2carbon.jks");
System.setProperty("Security.KeyStore.Password", "wso2carbon");
}
}
I uploaded the service into a wso2as-5.2.1, and called the service using SOAPUI
the request returned an error message "cannot borrow client for TCP"
I debug, and found out the problem might lies with the class 'KeyStoreUtil',
where the 'filePath' somehow retuned a 'null',
static File filePath = new File("../../../repository/resources/security");
and caused the failure on this line
DataPublisher dataPublisher = new DataPublisher(PROTOCOL + CEPHOST + ":" + CEPPORT, CEPUSERNAME, CEPPASSWORD);
I guess it could be a better idea if I use the value of "CARBON_HOME" to figure out the location of Key Store
so my question is :
How may I be able to get the value of 'CARBON_HOME' in the Java code?
that said. If you think a bit more:
the service will be called numerous time; whileas the 'setTrustStoreParams' and the 'setKeyStoreParams' will only be needed to executed once at the server/service initiate.
So, are there any even better ways to remove 'setTrustStoreParams' and 'setKeyStoreParams' out of the service code, or implement as configurable items?
Please advise
thanks
so my question is :
How may I be able to get the value of 'CARBON_HOME' in the Java code?
You should use the property carbon.home like following which will retrieve the WSO2 product's home directory.
System.getProperty("carbon.home");

How to create new record from web service in ADF?

I have created a class and published it as web service. I have created a web method like this:
public void addNewRow(MyObject cob) {
MyAppModule myAppModule = new MyAppModule();
try {
ViewObjectImpl vo = myAppModule.getMyVewObject1();
================> vo object is now null
Row r = vo.createRow();
r.setAttribute("Param1", cob.getParam1());
r.setAttribute("Param2", cob.getParam2());
vo.executeQuery();
getTransaction().commit();
} catch (Exception e) {
e.printStackTrace();
}
}
As I have written in code, myAppModule.getMyVewObject1() returns a null object. I do not understand why! As far as I know AppModule has to initialize the object by itself when I call "getMyVewObject1()" but maybe I am wrong, or maybe this is not the way it should be for web methods. Has anyone ever faced this issue? Any help would be very appreciated.
You can check nice tutorial: Building and Using Web Services with JDeveloper
It gives you general idea about how you should build your webservices with ADF.
Another approach is when you need to call existing Application Module from some bean that doesn't have needed environment (servlet, etc), then you can initialize it like this:
String appModuleName = "org.my.package.name.model.AppModule";
String appModuleConfig = "AppModuleLocal";
ApplicationModule am = Configuration.createRootApplicationModule(appModuleName, appModuleConfig);
Don't forget to release it:
Configuration.releaseRootApplicationModule(am, true);
And why you shouldn't really do it like this.
And even more...
Better aproach is to get access to binding layer and do call from there.
Here is a nice article.
Per Our PM : If you don't use it in the context of an ADF application then the following code should be used (sample code is from a project I am involved in). Note the release of the AM at the end of the request
#WebService(serviceName = "LightViewerSoapService")
public class LightViewerSoapService {
private final String amDef = " oracle.demo.lightbox.model.viewer.soap.services.LightBoxViewerService";
private final String config = "LightBoxViewerServiceLocal";
LightBoxViewerServiceImpl service;
public LightViewerSoapService() {
super();
}
#WebMethod
public List<Presentations> getAllUserPresentations(#WebParam(name = "userId") Long userId){
ArrayList<Presentations> al = new ArrayList<Presentations>();
service = (LightBoxViewerServiceImpl)getApplicationModule(amDef,config);
ViewObject vo = service.findViewObject("UserOwnedPresentations");
VariableValueManager vm = vo.ensureVariableManager();
vm.setVariableValue("userIdVariable", userId.toString());
vo.applyViewCriteria(vo.getViewCriteriaManager().getViewCriteria("byUserIdViewCriteria"));
Row rw = vo.first();
if(rw != null){
Presentations p = createPresentationFromRow(rw);
al.add(p);
while(vo.hasNext()){
rw = vo.next();
p = createPresentationFromRow(rw);
al.add(p);
}
}
releaseAm((ApplicationModule)service);
return al;
}
Have a look here too:
http://www.youtube.com/watch?v=jDBd3JuroMQ