Whats the Efficient way to call http request and read inputstream in spark MapTask

Whats the Efficient way to call http request and read inputstream in spark MapTask - amazon-web-services

Please see the below code sample
JavaRDD<String> mapRDD = filteredRecords
.map(new Function<String, String>() {
#Override
public String call(String url) throws Exception {
BufferedReader in = null;
URL formatURL = new URL((url.replaceAll("\"", ""))
.trim());
try {
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
} finally {
if (in != null) {
in.close();
}
}
}
});
here url is http GET request. example
http://ip:port/cyb/test?event=movie&id=604568837&name=SID&timestamp_secs=1460494800&timestamp_millis=1461729600000&back_up_id=676700166
This piece of code is very slow . IP and port are random and load is distributed so ip can have 20 different value with port so I dont see bottleneck .
When I comment
in = new BufferedReader(new InputStreamReader(con
.getInputStream()));
return in.readLine();
The code is too fast.
NOTE: Input data to process is 10GB. Using spark to read from S3.
is there anything wrong I am doing with BufferedReader or InputStreamReader any alternative .
I cant use foreach in spark as I have to get the response back from server and need to save JAVARdd as textFile on HDFS.
if we use mappartition code something as below
JavaRDD<String> mapRDD = filteredRecords.mapPartitions(new FlatMapFunction<Iterator<String>, String>() {
#Override
public Iterable<String> call(Iterator<String> tuple) throws Exception {
final List<String> rddList = new ArrayList<String>();
Iterable<String> iterable = new Iterable<String>() {
#Override
public Iterator<String> iterator() {
return rddList.iterator();
}
};
while(tuple.hasNext()) {
URL formatURL = new URL((tuple.next().replaceAll("\"", ""))
.trim());
HttpURLConnection con = (HttpURLConnection) formatURL
.openConnection();
try(BufferedReader br = new BufferedReader(new InputStreamReader(con
.getInputStream()))) {
rddList.add(br.readLine());
} catch (IOException ex) {
return rddList;
}
}
return iterable;
}
});
here also for each record we are doing same .. isnt it ?

Currently you are using
map function
which creates a url request for each row in the partition.
You can use
mapPartition
Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.

A big cost here is setting up TCP/HTTPS connections. This is exacerbated by the fact that Even if you only read the first (short) line of a large file, in an attempt to re-use HTTP/1.1 connections better, modern HTTP clients try to read() to the end of the file, so avoiding aborting the connection. This is a good strategy for small files, but not for those in MB.
There is a solution there: set the content-length on the read, so that only a smaller block is read in, reducing the cost of the close(); the connection recycling then reduces HTTPS setup costs. This is what the latest Hadoop/Spark S3A client does if you set fadvise=random on the connection: requests blocks rather than the entire multi-GB file. Be aware though: that design is actually really bad if you are going byte-by-byte through a file...

Related

Error 414 When sending invoice to Amazon MWS with _UPLOAD_VAT_INVOICE_

I'm trying to send invoices to amazon mws through _UPLOAD_VAT_INVOICE_ following the java example in this guide:
Link
pdf file is a simple invoice of 85 kb
The error is status code 414 that is "Uri too long"
Debugging original amazon class MarketplaceWebServiceClient I see this:
if( request instanceof SubmitFeedRequest ) {
// For SubmitFeed, HTTP body is reserved for the Feed Content and the function parameters
// are contained within the HTTP header
SubmitFeedRequest sfr = (SubmitFeedRequest)request;
method = new HttpPost( config.getServiceURL() + "?" + getSubmitFeedUrlParameters( parameters ) );
getSubmitFeedUrlParameters method takes every parameter and add it to querystring. One of these parameters is contentMD5 from:
String contentMD5 = Base64.encodeBase64String(pdfDocument);
So there is a very large string representing pdf file passed as parameter. This causes error 414
But that class is the original one taken from MaWSJavaClientLibrary-1.1.jar
Can anybody help me please?
Thanks

For the last 2 days I was working on the same problem,
I changed like this and it works now
InputStream contentStream = new ByteArrayInputStream(pdfDocument);
String contentMD5 =computeContentMD5Header(new ByteArrayInputStream(pdfDocument));
public static String computeContentMD5Header(InputStream inputStream) {
// Consume the stream to compute the MD5 as a side effect.
DigestInputStream s;
try {
s = new DigestInputStream(inputStream,
MessageDigest.getInstance("MD5"));
// drain the buffer, as the digest is computed as a side-effect
byte[] buffer = new byte[8192];
while (s.read(buffer) > 0)
;
return new String(
org.apache.commons.codec.binary.Base64.encodeBase64(s
.getMessageDigest().digest()), "UTF-8");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

S3AbortableInputStream : Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. Warning when reading only ObjectMetadata

I am using <aws.java.sdk>1.11.637</aws.java.sdk> with Spring boot 2.1.4.RELEASE.
Code : which causes S3 Warning (Here i have to access Only getUserMetadata from S3Object and not whole Object Content)
private Map<String, String> getUserHeaders(String key) throws IOException {
Map<String, String> userMetadata = new HashMap<>();
S3Object s3Object = null;
try {
s3Object = s3Client.getObject(new GetObjectRequest(bucketName, key));
userMetadata.putAll(s3Object.getObjectMetadata().getUserMetadata());
} finally {
if (s3Object != null) {
s3Object.close();
}
}
return userMetadata;
}
Output : Whenever s3Object.close(); is invoked , then i see warning on console with below message
{"msg":"Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.","logger":"com.amazonaws.services.s3.internal.S3AbortableInputStream","level":"WARN","component":"demo-app"}
My Investigation for error cause:
I further checked https://github.com/aws/aws-sdk-java/blob/c788ca832287484c327c8b32c0e2b0090a74c23d/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/S3AbortableInputStream.java#L173-L187 and it says if _readAllBytes() is not true (in my case where i am using S3Object just to getUserMetadata and not whole stream content) then there would be warning always.
Questions:
a) How S3Object.close is leading to invoke S3AbortableInputStream.close
as I assume code inside S3Object.close https://github.com/aws/aws-sdk-java/blob/c788ca832287484c327c8b32c0e2b0090a74c23d/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/model/S3Object.java#L222 only invoke SdkFilterInputStream.close via is.close();
b) How should I get rid of these warnings when i want use S3Object only to read meta data and not whole object content.

Maybe try using this API function designed to retrieve only the metadata for an S3 object:
ObjectMetadata getObjectMetadata(GetObjectMetadataRequest getObjectMetadataRequest)
throws SdkClientException,
AmazonServiceException
So change your code to:
ObjectMetadata s3ObjectMeta = null;
s3ObjectMeta = s3Client.getObjectMetadata(new GetObjectMetadataRequest(bucketName, key));
userMetadata.putAll(s3ObjectMeta.getUserMetadata());

Updating a file in Amazon S3 bucket

I am trying to append a string to the end of a text file stored in S3.
Currently I just read the contents of the file into a String, append my new text and resave the file back to S3.
Is there a better way to do this. I am thinkinig when the file is >>> 10MB then reading the entire file would not be a good idea so how should I do this correctly?
Current code
[code]
private void saveNoteToFile( String p_note ) throws IOException, ServletException
{
String str_infoFileName = "myfile.json";
String existingNotes = s3Helper.getfileContentFromS3( str_infoFileName );
existingNotes += p_note;
writeStringToS3( str_infoFileName , existingNotes );
}
public void writeStringToS3(String p_fileName, String p_data) throws IOException
{
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream( p_data.getBytes());
try {
streamFileToS3bucket( p_fileName, byteArrayInputStream, p_data.getBytes().length);
}
catch (AmazonServiceException e)
{
e.printStackTrace();
} catch (AmazonClientException e)
{
e.printStackTrace();
}
}
public void streamFileToS3bucket( String p_fileName, InputStream input, long size)
{
//Create sub folders if there is any in the file name.
p_fileName = p_fileName.replace("\\", "/");
if( p_fileName.charAt(0) == '/')
{
p_fileName = p_fileName.substring(1, p_fileName.length());
}
String folder = getFolderName( p_fileName );
if( folder.length() > 0)
{
if( !doesFolderExist(folder))
{
createFolder( folder );
}
}
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(size);
AccessControlList acl = new AccessControlList();
acl.grantPermission(GroupGrantee.AllUsers, Permission.Read);
s3Client.putObject(new PutObjectRequest(bucket, p_fileName , input,metadata).withAccessControlList(acl));
}
[/code]

It's not possible to append to an existing file on AWS S3. When you upload an object it creates a new version if it already exists:
If you upload an object with a key name that already exists in the
bucket, Amazon S3 creates another version of the object instead of
replacing the existing object
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/ObjectOperations.html
The objects are immutable.
It's also mentioned in these AWS Forum threads:
https://forums.aws.amazon.com/message.jspa?messageID=179375
https://forums.aws.amazon.com/message.jspa?messageID=540395

It's not possible to append to an existing file on AWS S3.
You can delete existing file and upload new file with same name.
Configuration
private string bucketName = "my-bucket-name-123";
private static string awsAccessKey = "AKI............";
private static string awsSecretKey = "+8Bo..................................";
IAmazonS3 client = new AmazonS3Client(awsAccessKey, awsSecretKey,
RegionEndpoint.APSoutheast2);
string awsFile = "my-folder/sub-folder/textFile.txt";
string localFilePath = "my-folder/sub-folder/textFile.txt";
To Delete
public void DeleteRefreshTokenFile()
{
try
{
var deleteFileRequest = new DeleteObjectRequest
{
BucketName = bucketName,
Key = awsFile
};
DeleteObjectResponse fileDeleteResponse = client.DeleteObject(deleteFileRequest);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}
To Upload
public void UploadRefreshTokenFile()
{
FileInfo file = new FileInfo(localFilePath);
try
{
PutObjectRequest request = new PutObjectRequest()
{
InputStream = file.OpenRead(),
BucketName = bucketName,
Key = awsFile
};
PutObjectResponse response = client.PutObject(request);
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
}

One option is to write the new lines/information to a new version of the file. This would create a LARGE number of versions. But, essentially, whatever program you are using the file for could read ALL the versions and append them back together when reading it (this seems like a really bad idea as I write it out).
Another option would be to write a new object each time with a time stamp appended to the object name. my-log-file-date-time . Then whatever program is reading from it could append them all together after downloading my-log-file-*.
You would want to delete objects older than a certain time just like log rotation.
Depending on how busy your events are this might work. If you have thousands per second, I don't think this would work. But if you just have a few events per minute it may be reasonable.

You can do it with s3api put-object.
First download the version you want and use below commend. it will upload as the latest version.
ᐅ aws s3api put-object --bucket $BUCKET --key $FOLDER/$FILE --body $YOUR_LOCAL_DOWNLOADED_VERSION_FILE

GATE Embedded runtime

I want to use "GATE" through web. Then I decide to create a SOAP web service in java with help of GATE Embedded.
But for the same document and saved Pipeline, I have a different run-time duration, when GATE Embedded runs as a java web service.
The same code has a constant run-time when it runs as a Java Application project.
In the web service, the run-time will be increasing after each execution until I get a Timeout error.
Does any one have this kind of experience?
This is my Code:
#WebService(serviceName = "GateWS")
public class GateWS {
#WebMethod(operationName = "gateengineapi")
public String gateengineapi(#WebParam(name = "PipelineNumber") String PipelineNumber, #WebParam(name = "Documents") String Docs) throws Exception {
try {
System.setProperty("gate.home", "C:\\GATE\\");
System.setProperty("shell.path", "C:\\cygwin2\\bin\\sh.exe");
Gate.init();
File GateHome = Gate.getGateHome();
File FrenchGapp = new File(GateHome, PipelineNumber);
CorpusController FrenchController;
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
Corpus corpus = Factory.newCorpus("BatchProcessApp Corpus");
FrenchController.setCorpus(corpus);
File docFile = new File(GateHome, Docs);
Document doc = Factory.newDocument(docFile.toURL(), "utf-8");
corpus.add(doc);
FrenchController.execute();
String docXMLString = null;
docXMLString = doc.toXml();
String outputFileName = doc.getName() + ".out.xml";
File outputFile = new File(docFile.getParentFile(), outputFileName);
FileOutputStream fos = new FileOutputStream(outputFile);
BufferedOutputStream bos = new BufferedOutputStream(fos);
OutputStreamWriter out;
out = new OutputStreamWriter(bos, "utf-8");
out.write(docXMLString);
out.close();
gate.Factory.deleteResource(doc);
return outputFileName;
} catch (Exception ex) {
return "ERROR: -> " + ex.getMessage();
}
}
}
I really appreciate any help you can provide.

The problem is that you're loading a new instance of the pipeline for every request, but then not freeing it again at the end of the request. GATE maintains a list internally of every PR/LR/controller that is loaded, so anything you load with Factory.createResource or PersistenceManager.loadObjectFrom... must be freed using Factory.deleteResource once it is no longer needed, typically using a try-finally:
FrenchController = (CorpusController) PersistenceManager.loadObjectFromFile(FrenchGapp);
try {
// ...
} finally {
Factory.deleteResource(FrenchController);
}
But...
Rather than loading a new instance of the pipeline every time, I would strongly recommend you explore a more efficient approach to load a smaller number of instances of the pipeline but keep them in memory to serve multiple requests. There is a fully worked-through example of this technique in the training materials on the GATE wiki, in particular module number 8 (track 2 Thursday).

Decoded response in Java ME (Nokia Asha)

I am implementing small Java ME app. This app gets some data from 3rd patty resource and needs to be authenticated before. I do first call for get cookies (it was easy), and the second call with this cookies for get data. I googled a little how to do it, and found next solution - Deal with cookie with J2ME
I have changed this code to next for my purpose:
public void getData(String url,String cookie) {
HttpConnection hpc = null;
InputStream is = null;
try {
hpc = (HttpConnection) Connector.open(url);
hpc.setRequestProperty("cookie", cookie);
hpc.setRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
hpc.setRequestProperty("Accept-Encoding", "gzip, deflate");
hpc.setRequestProperty("Accept-Language", "en-US,en;q=0.5");
is = hpc.openInputStream();
int length = (int) hpc.getLength();
byte[] response = new byte[length];
is.read(response);
String strResponse = new String(response);
} catch (Exception e) {
System.out.println(e.getMessage() + " " + e.toString());
} finally {
try {
if (is != null)
is.close();
if (hpc != null)
hpc.close();
} catch (Exception e) {}
}
}
I get something like to the next
??ÑÁNÃ0à;O±(²§M}A-?#
.?PYS¨Ôe¥Í#\üìde??XÊo}Vâ]hk?6ëµóA|µvÞz'Íà?wAúêmw4í0?ÐÆ?ÚMW=?òêz CÛUa:6Ö7¼T?<oF?nh6[_0?l4?äê&)?çó³?ÅÕúf¨ä(.? ªDÙ??§?ÊP+??(:?Á,Si¾ïA¥ã-jJÅÄ8ÊbBçL)gs.S.þG5ÌÀÆéX}CÁíÑ-þ?BDK`²?\¶?ó3I÷ô±e]°6¬c?q?Ó?¼?Y.¯??Y?%?ÏP1è?ìw;?È Ò??e
|ôh0?
How can I decode this?

Stupid me. I didn't take to consideration next code: hpc.setRequestProperty("Accept-Encoding", "gzip, deflate"); I get coded in ZIP response and everything that I need it decode it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Whats the Efficient way to call http request and read inputstream in spark MapTask - amazon-web-services

Currently you are using map function which creates a url request for each row in the partition. You can use mapPartition Which will make the code run faster as it creates connection to the server only once , that is only one connection per partition.

Related

Error 414 When sending invoice to Amazon MWS with _UPLOAD_VAT_INVOICE_

S3AbortableInputStream : Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. Warning when reading only ObjectMetadata

Updating a file in Amazon S3 bucket

GATE Embedded runtime

Decoded response in Java ME (Nokia Asha)

Categories

Resources