App Fabric capable of caching files or just database queries? - appfabric

I have a function that resides on each of our nodes. The function requires access to a folder path for accessing files (not a database). Currently, I synchronize the folders on each of the nodes so that we avoid accessing a single shared drive. Can I avoid this synchronization step and utilize App Fabric caching on a folder? Or is the caching only utilized for formal database queries? Any help would be appreciated.

You can cache any kind of object in AppFabric, as long as it's serialisable (or serializable if you're in the US :-) ) (which I assume is so it can be marshalled between servers correctly). So you could put all the files in a folder into an AppFabric cache, if you cache each file as an array of bytes.
Just because you can, however, doesn't mean that you should. You haven't said whether you're just reading from, or also writing to, these files; if you're just reading, on the first read your code would get the byte array from the cache, deserialise it out to disk, and then read it, but on subsequent reads why would you bother getting the cached version again? If you're writing as well, to update the file you'd again get the cached data, put it back onto disk as a file, update the file, then re-serialise it to update the cache and in a distributed environment like this I'd be concerned about how much time that would take and whether other servers would be making concurrent updates to the same data. You can get round that by introducing AppFabrics pessimistic concurrency, of course, but you'd have to determine for yourself whether the benefits outweigh the potential impact this might have.
You'd also probably still need the set of files on each node, or the shared folder you're trying to avoid relying on, so that the first node that populates the cache actually has the data to populate it with!
I think I'd look at something like folder replication first for keeping the nodes in sync, ahead of AppFabric caching.

Related

High performance ways to stream local files as they're being written to network

Today a system exists that will write packet-capture files to the local disk as they come in. Dropping these files to local disk as the first step is deemed desirable for fault-tolerance reasons. If a client dies and needs to reconnect or be brought up somewhere else, we enjoy the ability to replay from the disk.
The next step in the data pipeline is trying to get this data that was landed to disk out to remote clients. Assuming sufficient disk space, it strikes me as very convenient to use the local disk (and the page-cache on top of it) as a persistent boundless-FIFO. It is also desirable to use the file system to keep the coupling between the producer and consumer low.
In my research, I have not found a lot of guidance on this type of architecture. More specifically, I have not seen well-established patterns in popular open-source libraries/frameworks for reading the file as it is being written to stream out.
My questions:
Is there a flaw in this architecture that I am not noting or indirectly downplaying?
Are there recommendations for consuming a file as it is being written, and efficiently blocking and/or asynchronously being notified when more data is available in the file?
A goal would be to either explicitly or implicitly have the consumer benefit from page-cache warmth. Are there any recommendations on how to optimize for this?
The file-based solution sounds clunky but could work. Similarly to how tail -f does it:
read the file until EOF, but not close it
setup an inode watch (with inotify), waiting for more writes
repeat
The difficulty is usually with file rotation and cleanup, i.e. you need to watch for new files and/or truncation.
Having said that, it might be more efficient to connect to the packet-capture interface directly, or setup a queue to which clients can subscribe.

Databases vs Files (performance)

My C++ program has to read information about 256 images just one time. The information is simple: the path and some floats per image.
I don't need any kind of concurrent access. Also, I don't care about writing, deleting or updating the information and I don't have to do any kind of complex query. This is my pipeline:
Read information about one image.
Store that information on a object.
Do some calculation with the information.
Delete the object.
Next image.
I can use 256 files (every image has the same information), 1 file with all the information or a PostgreSQL databases. What will be faster?
Your question 'which will be faster' is tricky as performance is dependent on so many different factors, including OS, whether the database or file system are on the same machines as your application, the size of the images etc. I would guess that you could find some combinations that would make any of your options faster if you try hard enough.
Having said that, if everything is running on the same machine, then a file based approach would seem intuitively to be faster than a database, just because a database generally provides more functionality, and hence does more work (not just serving requests but in the background also) so has to use more of your computing power.
Similarly, it seems intuitive that a single file will be more efficient than multiple files as it saves the opening (and closing if necessary) operations associated with multiple files. But, again, giving an absolute answer is hard as opening and closing multiple files may be a common use case that certain OS's have optimised, hence making it as fast (or even faster) than a just using a single file.
If performance is very important for your solution, it is hard to avoid having to do some comparative testing with your target deployment systems.

Cross-platform atomic writes/renames without a transactional FS in C++

I'm working on the app that needs to ensure consistency of its data saved to disk. I need to guarantee that the data never gets corrupt when dumped to disk. I.e. a reboot or app shutdown could happen when saving the data.
I know the steps that need to be done:
http://blogs.msdn.com/b/adioltean/archive/2005/12/28/507866.aspx
But I was wondering whether there's already an implementation allowing for this preferably in a cross-platform way? I presume boost::filesystem guarantees atomic rename (on Windows and POSIX), so wondering if I missed this functionality in boost somewhere? Thanks
UPD: I had hopes for boost::interprocess::message_queue but it just hangs on reading the queue if the process is killed in the middle of adding to the queue + memory mapped file takes up maximum size on disk, which is expected to be the worst case anyway.
you can get decrease of performance and/or lose all app data, if you will use renaming. May be store some key information (record ID and fingerprint, for example) after each record, and seek last correct key information when application is starting is better way?

how we can ensure caching to reduce file-system write cycle for SQLite databases

I would like to implement caching in SQLite Database. My primary objective is to write data to RAM and when the Cache is filled I want to flush all the data to disk database. I would like to know whether it is possible at all? if possible can I have some sample codes?
Thanks
SQLite already does its own cacheing, which is likely to be more efficient than anything you can implement - you can read about the interface to it here. You may be interested in other optimisations - there is a FAQ here.
You might want to checkout the SQLite fine-tuning commands (pragmas)
Since sqlite is transactional, it relies on fsync to ensure a particular set of statements have completed when a transaction is committed. The speed and implementation of fsync varies from platform to platform.
So, by batching several statements within a transaction, you can get a significant increase in speed since several blocks of data will be written before fsync is called.
An older sqlite article here illustrates this difference between doing several INSERTs inside and outside transactions.
However, if you are writing an application needing concurrent access to data, note that when sqlite starts a write transaction, all reads (select statements) will be blocked. You may want to explore using your in memory cache to retrieve data while a write transaction is taking place.
With that said, it's also possible that sqlite's caching scheme will handle that for you.
Why do you want to do this? Are you running into performance issues? Or do you want to prevent other connections from seeing data until you commit it to disk?
Regarding syncing to disk, there is a tradeoff between database integrity and speed. Which you want depends on your situation.
Use transactions. Advantages: High reliability and simple. Disadvantages: once you start a transaction, no one else can write to the database until you COMMIT or ROLLBACK. This is usually the best solution. If you have a lot of work to do at once, begin a transaction, write everything you need, then COMMIT. All your changes will be cached in RAM until you COMMIT, at which time the database will explicitly sync to disk.
Use PRAGMA journal_mode=MEMORY and/or PRAGMA synchronous=OFF. Advantages: High speed and simple. Disadvantages: The database is no longer safe against power loss and program crashes. You can lose your entire database with these options. However, they avoid explicitly syncing to disk as often.
Write your changes to an in-memory database and manually sync when you want. Advantages: High speed and reliable. Disadvantages: Complicated, and another program can write to the database without you knowing about it. By writing to an in-memory database, you never need to sync to disk until you want to. Other programs can write to the database file, and if you're not careful you can overwrite those changes. This option is probably too complicated to be worth it.

Fastest small datastore on Windows

My app keeps track of the state of about 1000 objects. Those objects are read from and written to a persistent store (serialized) in no particular order.
Right now the app uses the registry to store each object's state. This is nice because:
It is simple
It is very fast
Individual object's state can be read/written without needing to read some larger entity (like pulling out a snippet from a large XML file)
There is a decent editor (RegEdit) which allow easily manipulating individual items
Having said that, I'm wondering if there is a better way. SQLite seems like a possibility, but you don't have the same level of multiple-reader/multiple-writer that you get with the registry, and no simple way to edit existing entries.
Any better suggestions? A bunch of flat files?
If what you mean by 'multiple-reader/multiple-writer' is that you keep a lot of threads writing to the store concurrently, SQLite is threadsafe (you can have concurrent SELECTs and concurrent writes are handled transparently). See the [FAQ [1]] and grep for 'threadsafe'
[1]: http://www.sqlite.org/faq.html/ FAQ
If you do begin to experiment with SQLite, you should know that "out of the box" it might not seem as fast as you would like, but it can quickly be made to be much faster by applying some established optimization tips:
SQLite optimization
Depending on the size of the data and the amount of RAM available, one of the best performance gains will occur by setting sqlite to use an all-in-memory database rather than writing to disk.
For in-memory databases, pass NULL as the filename argument to sqlite3_open and make sure that TEMP_STORE is defined appropriately
On the other hand, if you tell sqlite to use the harddisk, then you will get a similar benefit to your current usage of RegEdit to manipulate the program's data "on the fly."
The way you could simulate your current RegEdit technique with sqlite would be to use the sqlite command-line tool to connect to the on-disk database. You can run UPDATE statements on the sql data from the command-line while your main program is running (and/or while it is paused in break mode).
I doubt any sane person would go this route these days, however some of what you describe could be done with Window's Structured/Compound Storage. I only mention this since you're asking about Windows - and this is/was an official Windows way to do this.
This is how DOC files were put together (but not the new DOCX format). From MSDN it'll appear really complicated, but I've used it, it isn't the worst API in Win32.
it is not simple
it is fast, I would guess it's faster then the registry.
Individual object's state can be read/written without needing to read some larger entity.
There is no decent editor, however there are some real basic stuff (VC++ 6.0 had the "DocFile Viewer" under Tools. (yeah, that's what that thing did) I found a few more online.
You get a file instead of registry keys.
You gain some old-school Windows developer geek-cred.
Other random thoughts:
I think XML is the way to go (despite the random access issue). Heck, INI files may work. The registry gives you very fine grain security if you need it - people seem to forget this when the claim using files are better. An embedded DB seems like overkill if I'm understanding what you're doing.
Do you need to persist the objects on each change event or just in memory and store on shutdown? If so, just load them up and serialize them at the end, assuming your app runs for a long time (and you don't share that state with another program) then in memory is going to be a winner.
If you've got fixed size structures then you could consider just using a memory mapped file and allocate memory from that?
If the only thing you do is serialize/deserialize individual objects (no fancy queries), then use a btree database, for example Berkeley DB. It is very fast at storing and retrieving chunks of data by key (I assume your objects have some id that can be used as a key) and access by multiple processes is supported.