Chunk ItemReader processing multiple files - jberet

I need to process multiple input files through chunk. Before I rebuild the wheel, is there a BeanIO ItemReader out there that can do this? Or another approach?

I ended up creating a batchlet to process through the list of files by programmatically creating a chunk job and executing it.

Related

Deleteing millions of files from S3

I need to delete 64 million objects from a bucket, leaving about the same number of objects untouched. I have created an inventory of the bucket and used that to create a filtered inventory that has only the objects that need to be deleted.
I created a Lambda function that uses NodeJS to 'async' delete the objects that are fed to it.
I have created smaller inventories (10s, 100s and 1000s of objects) from the filtered one, and used S3 Batch Operation jobs to process these, and those all seem to check out: the expected files were deleted, and all other files remained.
Now, my questions:
Am I doing this right? Is this the preferred method to delete millions of files, or did my Googling misfire?
Is it advised to just create on big batch job and let that run, or is it better to break it up in chunks of, say, a million objects?
How long will this take (approx. of course)? Will S3 Batch go through the list and do each file sequentially? Or does it automagically scale out and do a whole bunch in parallel?
What am I forgetting?
Any suggestions, thoughts or criticisms are welcome. Thanks!
You might have a look into Stepfunctions Distributed Map feature. I do not know your specific use case but it could help to get the proper scaling.
Here is a short blog entry how you can achieve it.

How to synchronously add files in a queue?

I have a Django server running which utilizes files in a directory /PROCESSING_DOCS/*.json. An API call dynamically adds more files to this folder. Now I need to maintain a queue which updates the files added into that folder dynamically.
How can I implement this? I don't have any idea.
Here are a few suggestions right off the top of my head:
If you just need to keep a log of what files were added, processing status, etc:
since you're doing a lot of I/O you can add another file (ex: named files_queue) and append the filenames one per line. Later you may add additional details (CSV style) about each file (would be a bit of a challenge to search through it if this file grows big).
related to the first idea, if the number of files is not an issue you may create a file (like a .lock file for example) for each file processed and maybe store all processing details in it (and it will be easy to search).
if your application is connected to a database, create a table (ex: named files_queue) and insert one row per each file. Late you may add additional columns to the table to store additional details about each file.
If you're looking for queue manager there are a few solutions just a "python queue" google search away. I personally have used RabbitMQ.
Hope this helps,
Cheers!

AWS S3: distributed concatenation of tens of millions of json files in s3 bucket

I have an s3 bucket with tens of millions of relatively small json files, each less than 10 K.
To analyze them, I would like to merge them into a small number of files, each having one json per line (or some other separator), and several thousands of such lines.
This would allow me to more easily (and performantly) use all kind of big data tools out there.
Now, it is clear to me this cannot be done with one command or function call, but rather a distributed solution is needed, because of the amount of files involved.
The question is if there is something ready and packaged or must I pull out my own solution.
don't know of anything out there that can do this out of the box, but you can pretty easily do it yourself. the solution also depends a lot on how fast you need to get this done.
2 suggestions:
1) list all the files, split the list, download sections, merge and reupload.
2) list all the files, and after them go through them one at a time and read/download and write it to a kinesis steam. configure kinesis to dump the files to s3 via kinesis firehose.
In both scenarios the tricky bit is going to be handling failures and ensuring you don't get the data multiple times.
For completeness, if the files would be larger (>5MB) you could also leverage http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html which would allow you to merge files in S3 directly without having to download.
Assuming each json file is one line only, then I would do:
cat * >> bigfile
This will concat all files in a directory into the new file bigfile.
You can now read bigfile one line at a time, json decode the line and do something interesting with it.
If your json files are formatted for readability, then you will first need to combine all the lines in the file into one line.

Is there a way to limit the number of output files of a process?

An application of our company uses pdfimages (from xpdf) to check whether some pages in a PDF files, on which we know there is no text, consist of one image.
For this we run pdfimages on that page and count whether only one, two or more, or zero output files have been created (could be JPG, PPM, PGM or PPM).
The problem is that for some PDF files, we get millions of 14-byte PPM images, and the process has to be killed manually.
We know that by assigning the process to a job we can restrict how much time the process will run for. But it would probably be better if we could control that the process will create new files at most twice during its execution.
Do you have any clue for doing that?
Thank you.
One approach is to monitor the directory for file creations: http://msdn.microsoft.com/en-us/library/aa365261(v=vs.85).aspx - the monitoring app could then terminate the PDF image extraction process.
Another would be to use a simple ramdisk which limited the number of files that could be created: you might modify something like http://support.microsoft.com/kb/257405.
If you can set up a FAT16 filesystem, I think there's a limit of 128 files in the root directory, 512 in other dirs? - with such small files that would be reached quickly.
Also, aside from my 'joke' comment, you might want to check out _setmaxstdio and see if that helps ( http://msdn.microsoft.com/en-us/library/6e3b887c(VS.71).aspx ).

Django package multiple files in a single download

In Django I'm looking for a way to serve several different files at once. I can't use static archives (.zip, .tar, etc.) because I don't have enough storage to cache these files and it will take far too long to generate them on the fly (each could be in the 100s of megabytes).
Is there a way I can indicate to the browser that several files are coming its way? Perhaps there is a container format that I can indicate before streaming files to the user?
Edit: There could be hundreds of files in each package so asking the user to download each one is very time consuming.
Ah, the .tar file format can be streamed. I'll experiment with this for now.
http://docs.python.org/library/tarfile.html