Spring Batch process multiple files concurrently - concurrency

I'm using Spring Batch to process a large XML file (~ 2 millions entities) and update a database. The process is quite time-consuming, so I tried to use partitioning to try to speed up the processing.
The approach I'm pursuing is to split the large xml file in smaller files (say each 500 entities) and then use Spring Batch to process each file in parallel.
I'm struggling with the Java configuration to achieve the processing of multiple xml files in parallel. These are the relevant beans of my configuration
#Bean
public Partitioner partitioner(){
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
Resource[] resources;
try {
resources = resourcePatternResolver.getResources("file:/tmp/test/*.xml");
} catch (IOException e) {
throw new RuntimeException("I/O problems when resolving the input file pattern.",e);
}
partitioner.setResources(resources);
return partitioner;
}
#Bean
public Step partitionStep(){
return stepBuilderFactory.get("test-partitionStep")
.partitioner(personStep())
.partitioner("personStep", partitioner())
.taskExecutor(taskExecutor())
.build();
}
#Bean
public Step personStep() {
return stepBuilderFactory.get("personStep")
.<Person, Person>chunk(100)
.reader(personReader())
.processor(personProcessor())
.writer(personWriter)
.build();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor asyncTaskExecutor = new SimpleAsyncTaskExecutor("spring_batch");
asyncTaskExecutor.setConcurrencyLimit(10);
return asyncTaskExecutor;
}
When I execute the job, I get different XML parsing errors (every time a different one). If I remove all the xml files but one from the folder, then the processing works as expected.
I'm not sure I understand 100% the concept of Spring Batch partitioning, especially the "slave" part.
Thanks!

Related

Can I force a Cassandra table flush from the C/C++ driver like nodetool does?

I'm wondering whether I could replicate the forceKeyspaceFlush() function found in the nodetool utility from the C/C++ driver of Cassandra.
The nodetool function looks like this:
public class Flush extends NodeToolCmd
{
#Arguments(usage = "[<keyspace> <tables>...]", description = "The keyspace followed by one or many tables")
private List<String> args = new ArrayList<>();
#Override
public void execute(NodeProbe probe)
{
List<String> keyspaces = parseOptionalKeyspace(args, probe);
String[] tableNames = parseOptionalTables(args);
for (String keyspace : keyspaces)
{
try
{
probe.forceKeyspaceFlush(keyspace, tableNames);
} catch (Exception e)
{
throw new RuntimeException("Error occurred during flushing", e);
}
}
}
}
What I would like to replicate in my C++ software is this line:
probe.forceKeyspaceFlush(keyspace, tableNames);
Is it possible?
That's an unusual request, primarily because Cassandra is designed to be distributed, so if you're executing a query, you'd need to perform that blocking flush on each of the (potentially many) replicas. Rather than convince you that you don't really need this, I'll attempt to answer your question - however, you probably don't really need this.
Nodetool is using the JMX interface (on tcp/7199) to force that flush - Your c/c++ driver talks over the native protocol (on tcp/9042). At this time, flush is not possible via the native protocol.
Work around the limitation, you'd need to either exec a jmx-capable commandline utility (nodetool or other), implement a JMX client in c++ (it's been done), or extend the native protocol. None of those are particularly pleasant options, but I imagine executing a jmx CLI utility is significantly easier than the other two.

Large file upload with Spark framework

I'm trying to upload large files to a web application using the Spark framework, but I'm running into out of memory errors. It appears that spark is caching the request body in memory. I'd like either to cache file uploads on disk, or read the request as a stream.
I've tried using the streaming support of Apache Commons FileUpload, but it appears that calling request.raw().getInputStream() causes Spark to read the entire body into memory and return an InputStream view of that chunk of memory, as done by this code. Based on the comment in the file, this is so that getInputStream can be called multiple times. Is there any way to change this behavior?
I recently had the same problem and I figured out that you could bypass the caching. I do so with the following function:
public ServletInputStream getInputStream(Request request) throws IOException {
final HttpServletRequest raw = request.raw();
if (raw instanceof ServletRequestWrapper) {
return ((ServletRequestWrapper) raw).getRequest().getInputStream();
}
return raw.getInputStream();
}
This has been tested with Spark 2.4.
I'm not familiar with the inner workings of Spark so one potentiall, minor downside with this function is that you don't know if you get the cached InputStream or not, the cached version is reusable, the non-cached is not.
To get around this downside I suppose you could implement a function similar to the following:
public boolean hasCachedInputStream(Request request) {
return !(raw instanceof ServletRequestWrapper);
}
Short answer is not that I can see.
SparkServerFactory builds the JettyHandler, which has a private static class HttpRequestWrapper, than handles the InputStream into memory.
All that static stuff means no extending available.

Distributed Cache(Map side Joins)

I would like to know more about DistributedCache concept in Mapreduce.
In my Mapper class below i wrote a logic to read a file that is available in cache.
protected void setup(Context context) throws IOException,
InterruptedException {
super.setup(context);
localFiles =DistributedCache.getLocalCacheFiles(context.getConfiguration());
for(Path myfile:localFiles)
{
String line=null;
String nameofFile=myfile.getName();
File file =new File(nameofFile);
FileReader fr= new FileReader(file);
BufferedReader br= new BufferedReader(fr);
line=br.readLine();
while(line!=null)
{
String[] arr=line.split("\t");
myMap.put(arr[0], arr[1]);
line=br.readLine();
}
}
}
Can someone tell me when will the above setUp(context) method gets called. Is that setUP(context) method called only once or for every map task that setup(context) method will run?
It's called per Mapper task or Reducer task only once. So if 10 mappers or reducers were spawned for a job, then for each mapper and reducer it will be called once.
General guidelines for what to add in this method, is any task that is required to do once can be written here, e.g. getting path of Distributed cache, Passing and getting parameters to mappers and reducers.
Similar is for cleanup method.

Sharepoint 2013 Query very slow

we set up a new SharePoint 2013 Server to test how it would work as Document-Storage.
The Problem is, that it is very slow and I dont know why..
I adapted from msdn:
ClientContext _ctx;
private void btnConnect_Click(object sender, RoutedEventArgs e)
{
try
{
_ctx = new ClientContext("http://testSP1");
Web web = _ctx.Web;
Stopwatch w = new Stopwatch();
w.Start();
List list = _ctx.Web.Lists.GetByTitle("Test");
Debug.WriteLine(w.ElapsedMilliseconds); //24 first time, 0 second time
w.Restart();
CamlQuery q = CamlQuery.CreateAllItemsQuery(10);
ListItemCollection items = list.GetItems(q);
_ctx.Load(items);
_ctx.ExecuteQuery();
Debug.WriteLine(w.ElapsedMilliseconds); //1800 first time, 900 second Time
}
catch (Exception)
{
throw;
}
}
There arent very much Documents in the Test list.
Just 3 Folders and 1 Word-File.
Any suggestions/ideas why it is this slow?
Storing unstructured content (Word docs, PDFs, anything except metadata) in SharePoint's SQL content database is going to result in slower upload and retrieval than if the files are stored on the file system. That's why Microsoft created the Remote BLOB (Binary Large Object) Storage interface to enable files to be managed in SharePoint but live on the file system or in the cloud. The bigger the files, the greater the performance hit.
There are several third-party solutions that leverage this interface, including my company's offering, Metalogix StoragePoint. You can reach out to me at trossi#metalogix.com if you would like to learn more or visit http://www.metalogix.com/Products/StoragePoint/StoragePoint-BLOB-Offloading.aspx

Rest web service synchronisation

I am new to web services. I have written a rest web service which creates and returns pdf file. My code is as follows
#Path("/hello")
public class Hello {
#GET
#Path("/createpdf")
#Produces("application/pdf")
public Response getpdf() {
synchronized(this){
try {
OutputStream file = new FileOutputStream(new File("c:/temp/FirstPdf5.pdf"));
Document document = new Document();
PdfWriter.getInstance(document, file);
document.open();
document.add(new Paragraph("Hello Kiran"));
document.add(new Paragraph(new Date().toString()));
document.close();
file.close();
} catch (Exception e) {
e.printStackTrace();
}
File file1 = new File("c:/temp/FirstPdf5.pdf");
ResponseBuilder response = Response.ok((Object) file1);
response.header("Content-Disposition",
"attachment; filename=new-android-book.pdf");
return response.build();
}
}
}
If multiple clients try to call the web service simultaneousy , Does it impact on my code?
I mean , if client A using the web service and at the same time if client B tries to use the web service will the pdf file gets over writted.
If my question is not proper,Please let me know
Thanks
As you are writing the file to the hard disk multiple calls to the service will cause the file to be overwritten or cause exceptions where the file is already in use.
If the file is the same for all users then you would only need to read the file rather than write it every time.
However if the file is different for each user you might try one of the 2 following options:
You could build the file in memory and then write the binary response directly to the response stream.
Alternatively you could create the file using a unique name, this could be a GUID or a random number this would ensure that you never have a clash between the multiple calls arriving at the server.
I would also ensure that you remove the files