Jersey multipart streaming without disk buffering at the receiving server - web-services

I'm trying to stream (large) files over HTTP into a database. I'm using Tomcat and Jersey as Webframework. I noticed, that if I POST a file to my resource the file is first buffered on disk (in temp\MIME*.tmp} before it is handled in my doPOST method.
This is really an undesired behaviour since it doubles disk I/O and also leads to a somewhat bad UX, because if the browser is already done with uploading the user needs to wait a few minutes (depending on file size of course) until he gets the HTTP response.
I know that it's probably not the best implementation of a large file upload (since you don't even have any resume capabilities) but so are the requirements. :/
So my questions is, if there is any way to disable (disk) buffering for MULTIPART POSTs. Mem buffering is obviously too expensive, but I don't really see the need for disk buffering anyway? (Explain please) How do large sites like YouTube handle this situation? Or is there at least a chance to give the user immediate feedback if the file is sent? (Should be bad, since there could be still something like SQLException)

In case anybody is still interested, I solved the same issue by using the Apache Commons Streaming api
The code example on that page worked just fine for me.

Ok, so after days of reading and trying different stuff I stumbled upon HTTPServletRequest. At first I didn't even want to try since it takes away all the convenience methods from #FormDataParam but since i didn't know what else to do...
Turns out it helped. When I'm using #Context HTTPServletRequest request and request.getInputStream() i don't get disk buffering at all.
Now I just have to figure out how to get to the FormDataContentDisposition without #FormDataParam
Edit:
Ok. MultiPartFormData probably has to buffer on disk to parse the InputStream of the Request. So it seems I have to manually parse it myself, if I want to prevent any buffering :(

Your best bet is to take full control and write your own servlet that just grabs request.getInputStream (or request.getWriter if you are consuming text) and does the streaming itself. Most frameworks make your life "easy" by handling all the upload, temporary storage, etc. for you and often make it difficult to do things like streaming. It's quite easy to grab the stream yourself and do whatever you want.

I'm pretty sure Jersey is writing the files to disk to ensure memory is not flooded. Since you know exactly what you need to do with the incoming data -> stream into the database you probably have to write your own MessageBodyReader and get Jersey to use that to process your incoming multipart data.

Related

How to prevent QNetworkAccessManager from buffering POST contents from QIODevice?

I am not sure if the functionality I need isn't something that Qt is just not capable of but it seems that since there are multiple valid use cases maybe I am doing something wrong and someone has already dealt with this.
What I want do to is pretty simple - I want to issue a POST request with a given content length and provide the actual POST contents on the fly from a QIODevice. QNetworkAccessManager has a method QNetworkReply * QNetworkAccessManager::post(const QNetworkRequest & request, QIODevice * data) which seems to be ideal for this. The content length is large (lets assume 8 GB) and it seems that Qt tries to call QIODevice::readData to obtain all of the data before anything is emitted to the network (at least Wireshark is not showing anything whereas setting a small content length, for example 4, produces the behavior where 4 bytes are read and everything is transmitted). This has lead me to believe that Qt actually wants to buffer all of the POST content. I have explicitly set the QNetworkRequest::DoNotBufferUploadDataAttribute attribute but that does not change this.
It could be that it actually works like this and there is nothing I can do about it, but in that case, how would huge file upload that simply do not fit in memory work? Anyhow, any feedback from people who have experienced the same problem is welcome while I debug to see if Qt is actually buffering the whole thing.

How to calculate time to load a file into a application thorugh C++?

I have written a code in C++ to open a file in its default application like .doc in MS-Word now I want to calculate time to open a file into its application.
For that I need to know percentage of file loaded into that application. But from last 7 days I couldn't find any suitable solution. So can any one help me in solving this problem?
If i am using windows then can windows task manager help me to do this?
What you're trying to do is not only impossible, it doesn't even make sense.
When you play an MP3 in WMP, it doesn't load the whole file into memory. Instead, it maps a little bit of the file at a time into memory so it can decode the MP3 on the fly as it's playing. (I suppose if you play the song all the way through, without stopping or skipping or fast forwarding or rewinding, it will eventually read every byte of the file, probably finishing a few seconds before the song is over, but I doubt that's what you're looking for.)
Likewise, Word doesn't read any entire .doc file into memory (unless it's very small). That's how it's able to edit gigantic files without using huge amounts of memory. (Again, if you page through the whole file, it will probably eventually read every byte—for that matter, it may eventually copy enough of the file into an autosave backup file that it no longer needs to look at the original—but again, I doubt that's what you're looking for.)
If you only care about certain specific applications, and those applications have a COM Automation interface (as both WMP and Word do), they may have methods or events that will tell you when they're done "loading" a file (meaning they've read enough of it to start playing/displaying/etc.), or when they've "finished" with a file (meaning moved on to the next track, or whatever), but there's no generic answer to that; different applications will have different Automation interfaces. (And, as a side note, you really don't want to do COM Automation from C++ unless you really have to; it's much easier from jscript, vbscript, or your favorite .NET language…)
If the third party process does not signal that it has loaded something, e.g., through some output stream, one way will be to view the file handles being opened and closed by the processes. I presume this will be similar to how "task managers" like Process Explorer are able to view file handles of processes. However, if the process does not close the file handle once it is done "loading", then, you will not get an accurate time. Furthermore, you won't be able to get a "live" percentage of how much data has been loaded.

What is your strategy to write logs in your software to deal with possible HUGE amount of log messages?

Thanks for your time and sorry for this long message!
My work environment
Linux C/C++(but I'm new to Linux platform)
My question in brief
In the software I'm working on we write a LOT of log messages to local files which make the file size grow fast and finally use up all the disk space(ouch!). We want these log messages for trouble-shooting purpose, especially after the software is released to the customer site. I believe it's of course unacceptable to take up all the disk space of the customer's computer, but I have no good idea how to handle this. So I'm wondering if somebody has any good idea here. More info goes below.
What I am NOT asking
1). I'm NOT asking for a recommended C++ log library. We wrote a logger ourselves.
2). I'm NOT asking about what details(such as time stamp, thread ID, function name, etc) should be written in a log message. Some suggestions can be found here.
What I have done in my software
I separate the log messages into 3 categories:
SYSTEM: Only log the important steps in my software. Example: an outer invocation to the interface method of my software. The idea behind is from these messages we could see what is generally happening in the software. There aren't many such messages.
ERROR: Only log the error situations, such as an ID is not found. There usually aren't many such messages.
INFO: Log the detailed steps running inside my software. For example, when an interface method is called, a SYSTEM log message is written as mentioned above, and the entire calling routine into the internal modules within the interface method will be recorded with INFO messages. The idea behind is these messages could help us identify the detailed call stack for trouble-shooting or debugging. This is the source of the use-up-disk-space issue: There are always SO MANY INFO messages when the software is running normally.
My tries and thoughts
1). I tried to not record any INFO log messages. This resolves the disk space issue but I also lose a lot of information for debugging. Think about this: My customer is in a different city and it's expensive to go there often. Besides, they use an intranet that is 100% inaccessible from outside. Therefore: we can't always send engineers on-site as soon as they meet problems; we can't start a remote debug session. Thus log files, I think, are the only way we could make use to figure out the root of the trouble.
2). Maybe I could make the logging strategy configurable at run-time(currently it's before the software runs), that is: At normal run-time, the software only records SYSTEM and ERROR logs; when a problem arises, somebody could change the logging configuration so the INFO messages could be logged. But still: Who could change the configuration at run-time? Maybe we should educate the software admin?
3). Maybe I could always turn the INFO message logging on but pack the log files into a compressed package periodically? Hmm...
Finally...
What is your experience in your projects/work? Any thoughts/ideas/comments are welcome!
EDIT
THANKS for all your effort!!! Here is a summary of the key points from all the replies below(and I'll give them a try):
1). Do not use large log files. Use relatively small ones.
2). Deal with the oldest ones periodically(Either delete them or zip and put them to a larger storage).
3). Implement run-time configurable logging strategy.
There are two important things to take note of:
Extremely large files are unwieldy. They are hard to transmit, hard to investigate, ...
Log files are mostly text, and text is compressible
In my experience, a simple way to deal with this is:
Only write small files: start a new file for a new session or when the current file grows past a preset limit (I have found 50 MB to be quite effective). To help locate the file in which the logs have been written, make the date and time of creation part of the file name.
Compress the logs, either offline (once the file is finished) or online (on the fly).
Put up a cleaning routine in place, delete all files older than X days or whenever you reach more than 10, 20 or 50 files, delete the oldest.
If you wish to keep the System and Error logs longer, you might duplicate them in a specific rotating file that only track them.
Put altogether, this gives the following log folder:
Log/
info.120229.081643.log.gz // <-- older file (to be purged soon)
info.120306.080423.log // <-- complete (50 MB) file started at log in
(to be compressed soon)
info.120306.131743.log // <-- current file
mon.120102.080417.log.gz // <-- older mon file
mon.120229.081643.log.gz // <-- older mon file
mon.120306.080423.log // <-- current mon file (System + Error only)
Depending on whether you can schedule (cron) the cleanup task, you may simply spin up a thread for cleanup within your application. Whether you go with a purge date or a number of files limit is a choice you have to make, either is effective.
Note: from experience, a 50MB ends up weighing around 10MB when compressed on the fly and less than 5MB when compressed offline (on the fly is less efficient).
Your (3) is standard practice in the world of UNIX system logging.
When log file reaches a certain age or maximum size, start a new one
Zip or otherwise compress the old one
throw away the nth oldest compressed log
One way to deal with it is to rotate log files.
Start logging into a new file once you reach certain size and keep last couple of log files before you start overwriting the first one.
You will not have all possible info but you will have at least some stuff leading up to the issue.
The logging strategy sounds unusual but you have your reasons.
I would
a) Make the level of detail in the log messages configurable at run time.
b) Create a new log file for each day. You can then get cron to either compress them and/or delete them or perhaps transfer to off-ling storage.
My answer is to write long logs and then tweat out the info you want.
Compress them on a daily basis - but keep them for a week
I like to log a lot. In some programs I've kept the last n lines in memory and written to disk in case of an error or the user requesting support.
In one program it would keep the last 400 lines in memory and save this to a logging database upon an error. A separate service monitored this database and sent a HTTP request containing summary information to a service at our office which added this to a database there.
We had a program on each of our desktop machines that showed a list (updated by F5) of issues, which we could assign to ourselves and mark as processed. But now I'm getting carried away :)
This worked very well to help us support many users at several customers. If an error occurred on a PDA somewhere running our software then within a minute or so we'd get a new item on our screens. We'd often phone a user before they realised they had a problem.
We had a filtering mechanism to automatically process or assign issues that we knew we'd fixed or didn't care much about.
In other programs I've had hourly or daily files which are deleted after n days either by the program itself or by a dedicated log cleaning service.

File corruption detection and error handling

I'm a newbie C++ developer and I'm working on an application which needs to write out a log file every so often, and we've noticed that the log file has been corrupted a few times when running the app. The main scenarios seems to be when the program is shutting down, or crashes, but I'm concerned that this isn't the only time that something may go wrong, as the application was born out of a fairly "quick and dirty" project.
It's not critical to have to the most absolute up-to-date data saved, so one idea that someone mentioned was to alternatively write to two log files, and then if the program crashes at least one will still have proper integrity. But this doesn't smell right to me as I haven't really seen any other application use this method.
Are there any "best practises" or standard "patterns" or frameworks to deal with this problem?
At the moment I'm thinking of doing something like this -
Write data to a temp file
Check the data was written correctly with a hash
Rename the original file, and put the temp file in place.
Delete the original
Then if anything fails I can just roll back by just deleting the temp, and the original be untouched.
You must find the reason why the file gets corrupted. If the app crashes unexpectedly, it can't corrupt the file. The only thing that can happen is that the file is truncated (i.e. the last log messages are missing). But the app can't really jump around in the file and modify something elsewhere (unless you call seek in the logging code which would surprise me).
My guess is that the app is multi threaded and the logging code is being called from several threads which can easily lead to data corrupted before the data is written to the log.
You probably forgot to call fsync() every so often, or the data comes in from different threads without proper synchronization among them. Hard to tell without more information (platform, form of corruption you see).
A workaround would be to use logfile rollover, ie. starting a new file every so often.
I really think that you (and others) are wasting your time when you start adding complexity to log files. The whole point of a log is that it should be simple to use and implement, and should work most of the time. To that end, just write the log to an unbuffered stream (l;ike cerr in a C++ program) and live with any, very occasional in my experience, snafus.
OTOH, if you really need an audit trail of everything your app does, for legal reasons, then you should be using some form of transactional storage such as a SQL database.
Not sure if your app is multi-threaded -- if so, consider using Active Object Pattern (PDF) to put a queue in front of the log and make all writes within a single thread. That thread can commit the log in the background. All logs writes will be asynchronous, and in order, but not necessarily written immediately.
The active object can also batch writes.

Ensuring a file is flushed when file created in external process (Win32)

Windows Win32 C++ question about flushing file activity to disk.
I have an external application (ran using CreateProcess) which does some file creation. i.e., when it returns it will have created a file with some content.
How can I ensure that the file the process created was really flushed to disk, before I proceed?
By this I mean not the C++ buffers but really flushing disk (e.g. FlushFileBuffers).
Remember that I don't have access to any file HANDLE - this is all of course hidden inside the external process.
I guess I could open up a handle of my own to the file and then use FlushFileBuffers, but it's not clear this would work (since my handle doesn't actually contain anything which needs flushing).
Finally, I want this to run in non-admin userspace so I cannot use FlushFileBuffers on a whole volume.
Any ideas?
UPDATE: Why do I think this is a problem?
I'm working on a data backup application. Essentially it has to create some files as described. It then has to update it's internal DB (using SQLite embedded DB).
I recently had a data corruption issue which occurred during a bluescreen (the cause of which was unrelated to my app).
What I'm concerned about is application integrity during a system crash. And yes, I do care about this because this app is a data backup app.
The use case I'm concerned about is this:
A small data file is created using external process. This write is waiting in the OS cache to be written to disk.
I update the DB and commit. This is a disk activity. This write is also waiting in the OS cache.
A system failure occurs.
As I see it, we're now in a potential race condition. If "1" gets flushed and "2" doesn't then we're fine (as the DB transact wasn't then committed). If neither gets flushed or both get flushed then we're also OK.
As I understand it, the writes will be non-deterministic. i.e., I'm not aware that the OS will guarantee to write "1" before "2". (Am I wrong?)
So, if "2" gets flushed, but "1" doesn't then we have a problem.
What I observed was that the DB was correctly updated, but that the file had garbage in: the last 2 thirds of the data was binary "zeroes". Now, I don't know what it looks like when you have a file part flushed at the time of bluescreen, but I wouldn't be surprised if it looked like that.
Can I guarantee this is the cause? No I cannot guarantee this. I'm just speculating. It could just be that the file was "naturally" corrupted due to disk failure or as a result of the blue screen.
With regards to performance, this is something I believe I can deal with.
For example, the default behaviour of SQLite is to do a full file flush (using FlushFileBuffers) every time you commit a transaction. They are quite clear that if you don't do this then at the time of system crash, you might have a corrupted DB.
Also, I believe I can mitigate the performance hit by only flushing at "checkpoints". For example, writing 50 files, flushing the lot and then writing to the DB.
How likely is all this to be a problem? Beats me. But then my app might well be archiving at or around the time of system failure so it might be more likely that you think.
Hope that explains why I wan't to do this.
Why would you want this? The OS will make sure that the data is flushed to the disk in due time. If you access it, it will either return the data from the cache or from disk, so this is transparent for you.
If you need some safety in case of disaster, then you must call FlushFileBuffers, for example by creating a process with admin rights after running the external process. But that can severely impact the performance of the whole machine.
Your only other option is to modify the source of the other process.
[EDIT] The most simple solution is probably to copy the file in your process and then flush the copy (since you have the handle). Save the copy under a name which says "not committed in the database".
Then update the database. Write into the database, "updated from file ...". If this entry already exists next time, don't update the database and skip this step.
Flush the database to disk.
Rename the file to "file has been processed into database". Rename is an atomic operation (so it either happens or not).
If you can't think of a good filename for the different states, then use subfolders and move the file between them.
Well, there are no attractive options here. There is no documented way to retrieve the file handle you need from the process. Although there are undocumented ones, go there (via DuplicateHandle) only with careful consideration.
Yes, calling FlushFileBuffers on a volume handle is the documented way. You can avoid the privilege problem by letting a service make the call. Talk to it from your app with one of the standard process interop mechanisms. A named pipe whose name is prefixed with Global\ is probably the easiest way to get that going.
After your update I think http://sqlite.org/atomiccommit.html gives you the answers you need.
The way SQLite ensures that everything is flushed to disc works. So it works for you as well - take a look at the source.