Large file upload with Spark framework - jetty

I'm trying to upload large files to a web application using the Spark framework, but I'm running into out of memory errors. It appears that spark is caching the request body in memory. I'd like either to cache file uploads on disk, or read the request as a stream.
I've tried using the streaming support of Apache Commons FileUpload, but it appears that calling request.raw().getInputStream() causes Spark to read the entire body into memory and return an InputStream view of that chunk of memory, as done by this code. Based on the comment in the file, this is so that getInputStream can be called multiple times. Is there any way to change this behavior?

I recently had the same problem and I figured out that you could bypass the caching. I do so with the following function:
public ServletInputStream getInputStream(Request request) throws IOException {
final HttpServletRequest raw = request.raw();
if (raw instanceof ServletRequestWrapper) {
return ((ServletRequestWrapper) raw).getRequest().getInputStream();
}
return raw.getInputStream();
}
This has been tested with Spark 2.4.
I'm not familiar with the inner workings of Spark so one potentiall, minor downside with this function is that you don't know if you get the cached InputStream or not, the cached version is reusable, the non-cached is not.
To get around this downside I suppose you could implement a function similar to the following:
public boolean hasCachedInputStream(Request request) {
return !(raw instanceof ServletRequestWrapper);
}

Short answer is not that I can see.
SparkServerFactory builds the JettyHandler, which has a private static class HttpRequestWrapper, than handles the InputStream into memory.
All that static stuff means no extending available.

Related

Stage level data is not coming for bigquery running jobs through java bigquery libraries

I am using com.google.cloud.bigquery library for fetching the job level details. We have the following code snippets
Job job = getBigQuery(projectId, location).getJob(JobId.newBuilder().setJob("myJobId").
setLocation(location).setProject(projectId).build());
private BigQuery getBigQuery(String projectId, String location) throws IOException {
// path to your credentials file
String credentialsPath = "my private key crdentials file";
BigQuery bigQuery;
bigQuery = BigQueryOptions.newBuilder().setProjectId(projectId).setLocation(location)
.setCredentials(GoogleCredentials.fromStream(new FileInputStream(credentialsPath))).build()
.getService();
return bigQuery;
}
My Dependency
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-bigquery</artifactId>
<version>2.10.0</version>
</dependency>
Now for completed jobs, I have no issue, but for some jobs which are in a running state like having a duration of more than 1 minute, we are getting the incomplete query plan data which is ultimately giving the null pointer exception.
If we observe the picture, for the job, there is jobStatistics part, there it is giving the warning like it will throw java.lang.NullPointerException .
Now the main issue is, in our processing, when we check the queryPlan field, it is not null and it is showing the size of some number. When I try to process that in any loop, iterator, stream it is throwing the NullPointerException.
When I try to fetch the data for the same running job using API, it is giving complete details.
Ultimately the conclusion is why the bigquery is giving different results for the java library and API, why there is incompleteness in the java library side(I have tried by updating the dependency version also). What is the solution for me, how can I prevent my code from going into the NullPointerException.
Ultimately the library is also using the same API, but somehow in the internal processing the query plan data is not getting generated properly when the job is in running state.
I was able to test the behaviour of the code as well as the API. When the query is running, most of the API response fields under queryPlan are 0, therefore not complete. Only when the query has completed its execution, the queryPlan field shows the complete information.
Also, as per this client library documentation, the queryPlan is available only once the query has completed its execution. So, the NullPointerException is the expected behaviour when the query is still running (tested this as well).
To prevent the NullPointerException, you might have to access the queryPlan when the state of the query is DONE.

Apache Beam Kafka IO for Json messages - org.apache.kafka.common.errors.SerializationException

Am trying to get familiar with Apache beam Kafka IO and getting following error
Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:348)
at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:318)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:213)
at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:67)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at com.andrewjones.KafkaConsumerExample.main(KafkaConsumerExample.java:58)
Caused by: org.apache.kafka.common.errors.SerializationException: Size of data received by LongDeserializer is not 8
Following is piece of code which reads messages from a kafka topic. Appreciate if you someone can provide some pointers.
public class KafkaConsumerJsonExample {
static final String TOKENIZER_PATTERN = "[^\p{L}]+";
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(options);
p.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("myTopic2")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("auto.offset.reset", (Object)"earliest"))
// We're writing to a file, which does not support unbounded data sources. This line makes it bounded to
// the first 5 records.
// In reality, we would likely be writing to a data source that supports unbounded data, such as BigQuery.
.withMaxNumRecords(5)
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create())
.apply(TextIO.write().to("wordcounts"));
System.out.println("running data pipeline");
p.run().waitUntilFinish();
}
}
The issue is caused by using LongDeserializer for keys that seems were serialised by other serialiser than Long and it depends how you produced the records.
So, you either can use a proper deserializer or, if keys don't matter, as a workaround, try to use StringDeserializer for keys as well.

OutOfMemoryError when creating AmazonS3Client in Lambda

I have an AWS Lambda function, configured with only 128MB of memory, is triggered by SNS (which is itself triggered by S3) and will download the file from S3.
In my function, I have the following:
public class LambdaHandler {
private final AmazonS3Client s3Client = new AmazonS3Client();
public void gdeltHandler(SNSEvent event, Context context) {
System.out.println("Starting");
System.out.println("Found " + eventFiles.size() + " event files");
}
I've commented out and excluded from this post all of the logic because I am getting an OutOfMemoryError which I have isolated to the creation of the AmazonS3Client object. When I take that object out, I don't get the error. The exact above code results in the OutOfMemoryError.
I assigned 128MB of memory to the function, is that really not enough to simply grab the credentials and instantiate the AmazonS3Client object?
I've tried giving the AmazonS3Client constructor
new EnvironmentVariableCredentialsProvider()
as well as
new InstanceProfileCredentialsProvider()
with similar results.
Does the creation of the AmazonS3Client object simply require more memory?
Below is the stack trace:
Metaspace: java.lang.OutOfMemoryError java.lang.OutOfMemoryError:
Metaspace at
com.fasterxml.jackson.databind.deser.BeanDeserializerBuilder.build(BeanDeserializerBuilder.java:347)
at
com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.buildBeanDeserializer(BeanDeserializerFactory.java:242)
at
com.fasterxml.jackson.databind.deser.BeanDeserializerFactory.createBeanDeserializer(BeanDeserializerFactory.java:143)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:409)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:358)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:265)
at
com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:245)
at
com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:143)
at
com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:439)
at
com.fasterxml.jackson.databind.ObjectReader._prefetchRootDeserializer(ObjectReader.java:1588)
at
com.fasterxml.jackson.databind.ObjectReader.(ObjectReader.java:185)
at
com.fasterxml.jackson.databind.ObjectMapper._newReader(ObjectMapper.java:558)
at
com.fasterxml.jackson.databind.ObjectMapper.reader(ObjectMapper.java:3108)
When I try providing the InstanceProfileCredentialsProvider or EnvironmentVariableCredentialsProvider, I get the following stack trace:
Exception in thread "main" java.lang.Error:
java.lang.OutOfMemoryError: Metaspace at
lambdainternal.AWSLambda.(AWSLambda.java:62) at
java.lang.Class.forName0(Native Method) at
java.lang.Class.forName(Class.java:348) at
lambdainternal.LambdaRTEntry.main(LambdaRTEntry.java:94) Caused by:
java.lang.OutOfMemoryError: Metaspace at
java.lang.ClassLoader.defineClass1(Native Method) at
java.lang.ClassLoader.defineClass(ClassLoader.java:763) at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at
java.net.URLClassLoader.access$100(URLClassLoader.java:73) at
java.net.URLClassLoader$1.run(URLClassLoader.java:368) at
java.net.URLClassLoader$1.run(URLClassLoader.java:362) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(URLClassLoader.java:361) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
lambdainternal.EventHandlerLoader$PojoMethodRequestHandler.makeRequestHandler(EventHandlerLoader.java:421)
at
lambdainternal.EventHandlerLoader.getTwoLengthHandler(EventHandlerLoader.java:777)
at
lambdainternal.EventHandlerLoader.getHandlerFromOverload(EventHandlerLoader.java:802)
at
lambdainternal.EventHandlerLoader.loadEventPojoHandler(EventHandlerLoader.java:888)
at
lambdainternal.EventHandlerLoader.loadEventHandler(EventHandlerLoader.java:740)
at
lambdainternal.AWSLambda.findUserMethodsImmediate(AWSLambda.java:126)
at lambdainternal.AWSLambda.findUserMethods(AWSLambda.java:71) at
lambdainternal.AWSLambda.startRuntime(AWSLambda.java:219) at
lambdainternal.AWSLambda.(AWSLambda.java:60) ... 3 more START
RequestId: 58837136-483e-11e6-9ed3-39246839616a Version: $LATEST END
RequestId: 58837136-483e-11e6-9ed3-39246839616a REPORT RequestId:
58837136-483e-11e6-9ed3-39246839616a Duration: 15002.92 ms Billed
Duration: 15000 ms Memory Size: 128 MB Max Memory Used: 50 MB
2016-07-12T14:40:28.048Z 58837136-483e-11e6-9ed3-39246839616a Task
timed out after 15.00 seconds
EDIT 1 If I increase the memory allocated to the function to even 192MB, it works just fine, though strangely enough, reports only using 59MB of memory in the cloudwatch logs. Am I simply losing the rest of the memory?
I have been observing this when using AWS Java SDK within the Lambda function.
It would seem when creating any of the AWS clients (Sync or Async) you may get out of Metaspace.
I believe this is due to things that the Amazon Client is performing upon instantiation, including AmazonHttpClient creation as well as dynamic loading of request handler chains (part of AmazonEc2Client#init() private method).
It is possible that the reported memory usage is for Heap itself, but may not include Metaspace. There are a few threads on AWS Forums but no responses from AWS on the matter.
One way to reduce cold start is setting the memory to 1536 mb and the timeout to 15 min. This will give dedicated host to run only your lambda instead of running your lambda on shared host + when a new instance has to be started, it will copy the code from cache on the host rather than copying from S3.
This though will be more expensive and if you don't want to do this, continue reading below.
How can I reduce my cold start times?
Follow the Lambda best practices
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html
By choosing a larger memory setting for your function
Think of the memory as a "power" setting because it also dictates how much CPU your function will receive.
By reducing the size of your function ZIP
This likely means reducing the number of dependencies you include in your function ZIP.
Java JARs can be further reduced in size using ProGuard
[Java Only] Use the bytestream interface instead of the POJO interface.
The JSON serialization libraries that Lambda uses internally can take some time to start. It will take dev work on your end, but you may be able to improve on this by using the byte stream interface along with a lightweight JSON library. Here are some links that may help:
http://docs.aws.amazon.com/lambda/latest/dg/java-handler-io-type-stream.html
https://github.com/FasterXML/jackson-jr
[Java Only] Don't use Java 8 feature that replaces anonymous classes (lambdas, method references, constructor references, etc.)
We've noticed internally that Java 8 Lambda-related bytecode appears to result in sub-optimal startup performance. If your code is using any Java 8 feature that replaces anonymous classes (lambdas, method references, constructor references, etc.) you may get better startup time by moving back to anonymous classes.
By using a different runtime
Different runtimes have different cold start times, and different runtime performance. While NodeJS might be better for heavy IO work, Go might be better for code that does a lot of concurrent work. Customers have done some basic benchmarks to compare language performance on Lambda, and here is a more generic comparison of different programming languages performance. There is no one-size-fits-all answer, use what makes sense for your requirements.
basic benchmarks:https://read.acloud.guru/comparing-aws-lambda-performance-of-node-js-python-java-c-and-go-29c1163c2581
generic comparison : https://benchmarksgame-team.pages.debian.net/benchmarksgame/which-programs-are-fast.html
Try to increase the memory allocated to lambda from 128 to 256 MB
I use a tactic that helps for Java-based lambdas. Any class resources that only need a single (reusable) instance can be declared as static class members, and initialized inside a static initializer block. When the lambda creates a new instance of the class to handle an execution, those expensive resources are already initialized. Here is a simple example:
package com.mydomain.myapp.lambda.sqs;
import com.amazonaws.services.lambda.runtime.events.SQSEvent;
import com.amazonaws.services.sns.AmazonSNS;
import com.amazonaws.services.sns.AmazonSNSClientBuilder;
import com.amazonaws.services.sqs.AmazonSQS;
import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.Connection;
import java.sql.SQLException;
import java.util.Objects;
public class MyLambdaFunctionHandler {
private static final Logger LOGGER = LoggerFactory.getLogger(MyLambdaFunctionHandler.class);
// These values come from the 'Environment' property for the lambda, defined in template.yaml
private static final String ENV_NAME = System.getenv("ENV_NAME");
// Declare these as static properties so they only need to be created once,
// rather than on each invocation of the lambda handler method that uses them
private static final ObjectMapper OBJECT_MAPPER;
private static final AmazonSNS SNS;
private static final AmazonSQS SQS;
static {
LOGGER.info("static initializer | START");
Objects.requireNonNull(ENV_NAME, "ENV_NAME cannot be null");
OBJECT_MAPPER = new ObjectMapper();
SNS = AmazonSNSClientBuilder.defaultClient();
SQS = AmazonSQSClientBuilder.defaultClient();
LOGGER.info("static initializer | END");
}
public MyLambdaFunctionHandler() {
LOGGER.info("constructor invoked");
}
public void handlerMethod(SQSEvent event) {
LOGGER.info("Received SQSEvent with {} messages", event.getRecords().size());
event.getRecords().forEach(message -> handleOneSQSMessage(message));
}
private void handleOneSQSMessage(SQSEvent.SQSMessage message) {
// your SQS message handling code here...
}
}
The properties I declared as static will stay in memory until the lambda instance is destroyed by AWS.
This isn't how I would normally write Java code. Lambda-based code is treated differently, so I think it is OK to break some traditional patterns here.

invalid characters in JSON message after using await/job in play framework 1.2.5

I am using play framework 1.2.5 jobs - after await, I send a message to the web UI in JSON format. The same JSON logic works fine when not using jobs - however, after using jobs and await, the JSON message appears to contain invalid characters (client side javascript does not recognize it as valid JSON anymore). The browser does not render the garbled/invalid characters - I will try using wireshark and see if I can add more details. Any ideas on what could be causing this and how best to prevent this? Thanks in advance (I'm reasonably sure its my code causing the problem since I can't be the first one doing this). I will also try to test using executors/futures instead of play jobs and see how that goes.
Promise<String> testChk = new TestJobs(testInfo, "validateTest").now(); //TestJobs extends Job<String> & I'm overriding doJobWithResult. Also, constructor for TestJobs takes two fields (type for testInfo & String)
String testChkResp = await(testChk);
renderJSON(new TestMessage("fail", "failure message")); //TestMessage class has two String fields and is serializable
Update: I am using gson & JDK1.6
Update It seems that there is a problem with encoding whenever I use play jobs and renderJSON.
TestMessage: (works when not using jobs)
import java.io.Serializable;
public class TestMessage {
public String status;
public String response;
public TestMessage() {
}
public TestMessage(String status, String response) {
this.status = status;
this.response = response;
}
}
Update:
Even using the following results in utf-8 impact when using while relying on jobs.
RenderJSON("test");
Sounds like it could be a bug. It may be related to your template - does it specify the encoding explicitly?
What format is the response? You can determine this by using the inspector in chrome or Web Console in Firefox.
(Though I certainly agree the behaviour should be consistent - it may be worth filing a bug here: http://play.lighthouseapp.com/projects/57987-play-framework/tickets )
It's a workaround; first reset the outputstream then render.
response.reset();
response.contentType="application/json; charset=utf-8";
renderJSON("play has some bugs")
I was able to use futures & callables with executors and the same code as mentioned above works (using play 1.2.5). The only difference was that I was not explicitly using play jobs (and hence the issue does not appear to be related to gson).

many queries in a task to generate json

So I've got a task to build which is going to archive a ton of data in our DB into JSON.
To give you a better idea of what is happening; X has 100s of Ys, and Y has 100s of Zs and so on. I'm creating a json file for every X, Y, and Z. But every X json file has an array of ids for the child Ys of X, and likewise the Ys store an array of child Zs..
It more complicated than that in many cases, but you should get an idea of the complexity involved from that example I think.
I was using ColdFusion but it seems to be a bad choice for this task because it is crashing due to memory errors. It seems to me that if it were removing queries from memory that are no longer referenced while running the task (ie: garbage collecting) then the task should have enough memory, but afaict ColdFusion isn't doing any garbage collection at all, and must be doing it after a request is complete.
So I'm looking either for advice on how to better achieve my task in CF, or for recommendations on other languages to use..
Thanks.
1) If you have debugging enabled, coldfusion will hold on to your queries until the page is done. Turn it off!
2) You may need to structDelete() the query variable to allow it to be garbage collected, otherwise it may persist as long as the scope that has a reference to it exists. eg.,
<cfset structDelete(variables,'myQuery') />
3) A cfquery pulls the entire ResultSet into memory. Most of the time this is fine. But for reporting on a large result set, you don't want this. Some JDBC drivers support setting the fetchSize, which in a forward, read only fashion, will let you get a few results at a time. This way you can deal with thousands and thousands of rows, without swamping memory. I just generated a 1GB csv file in ~80 seconds, using less than 100mb of heap. This requires dropping out to Java. But it kills two birds with one stone. It reduces the amount of data brought in at a time by the JDBC driver, and since you're working directly with the ResultSet, you don't hit the cfloop problem #orangepips mentioned. Granted, it's not for those without some Java chops.
You can do it something like this (you need cfusion.jar in your build path):
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.sql.ResultSet;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import au.com.bytecode.opencsv.CSVWriter;
import coldfusion.server.ServiceFactory;
public class CSVExport {
public static void export(String dsn,String query,String fileName) {
Connection conn = null;
Statement stmt = null;
ResultSet rs = null;
FileWriter fw = null;
BufferedWriter bw = null;
try {
DataSource ds = ServiceFactory.getDataSourceService().getDatasource(dsn);
conn = ds.getConnection();
// we want a forward-only, read-only result.
// you may want need to use a PreparedStatement instead.
stmt = conn.createStatement(
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_READ_ONLY
);
// we only want to go forward!
stmt.setFetchDirect(ResultSet.FETCH_FORWARD);
// how many records to pull back at a time.
// the hard part is balancing memory usage, and round trips to the database.
// basically sacrificing speed for a lower memory hit.
stmt.setFetchSize(256);
rs = stmt.executeQuery(query);
// do something with the ResultSet, for example write to csv using opencsv
// the key is to stream it. you don't want it stored in memory.
// so excel spreadsheets and pdf files are out, but text formats like
// like csv, json, html, and some binary formats like MDB (via jackcess)
// that support streaming are in.
fw = new FileWriter(fileName);
bw = new BufferedWriter(fw);
CSVWriter writer = new CSVWriter(bw);
writer.writeAll(rs,true);
}
catch (Exception e) {
// handle your exception.
// maybe try ServiceFactory.getLoggingService() if you want to do a cflog.
e.printStackTrace();
}
finally() {
try {rs.close()} catch (Exception e) {}
try {stmt.close()} catch (Exception e) {}
try {conn.close()} catch (Exception e) {}
try {bw.close()} catch (Exception e) {}
try {fw.close()} catch (Exception e) {}
}
}
}
Figuring out how to pass parameters, logging, turning this into a background process (hint: extend Thread) etc. are separate issues, but if you grok this code, it shouldn't be too difficult.
4) Perhaps look at Jackson for generating your json. It supports streaming, and combined with the fetchSize, and a BufferedOutputStream, you should be able to keep the memory usage way down.
Eric, you are absolutely correct about ColdFusion garbage collection not removing query information from memory until request end and I've documented it fairly extensively in another SO question. In short, you hit OoM Exceptions when you loop over queries. You can prove it using a tool like VisualVM to generate a heap dump while the process is running and then running the resulting dump through Eclipse Memory Analyzer Tool (MAT). What MAT would show you is a large hierarchy, starting with an object named (I'm not making this up) CFDummyContent that holds, among other things, references to cfquery and cfqueryparam tags. Note, attempting to change it up to stored procs or even doing the database interaction via JDBC does not make difference.
So. What. To. Do?
This took me a while to figure out, but you've got 3 options in increasing order of complexity:
<cthread/>
asynchronous CFML gateway
daisy chain http requests
Using cfthread looks like this:
<cfloop ...>
<cfset threadName = "thread" & createUuid()>
<cfthread name="#threadName#" input="#value#">
<!--- do query stuff --->
<!--- code has access to passed attributes (e.g. #attributes.input#) --->
<cfset thread.passOutOfThread = somethingGeneratedInTheThread>
</cfthread>
<cfthread action="join" name="#threadName#">
<cfset passedOutOfThread = cfthread["#threadName#"].passOutOfThread>
</cfloop>
Note, this code is not taking advantage of asynchronous processing, thus the immediate join after each thread call, but rather the side effect that cfthread runs in its own request-like scope independent of the page.
I'll not cover ColdFusion gateways here. HTTP daisy chaining means executing an increment of the work, and at the end of the increment launching a request to the same algorithm telling it to execute the next increment.
Basically, all three approaches allow those memory references to be collected mid process.
And yes, for whoever asks, bugs have been raised with Adobe, see the question referenced. Also, I believe this issue is specific to Adobe ColdFusion, but have not tested Railo or OpenDB.
Finally, have to rant. I've spent a lot of time tracking this one down, fixing it in my own large code base, and several others listed in the question referenced have as well. AFAIK Adobe has not acknowledge the issue much-the-less committed to fixing it. And, yes it's a bug, plain and simple.