Fastest/Best way to serialize and deserialize data from database [closed]

Fastest/Best way to serialize and deserialize data from database [closed] - c++

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In a few months I will start to write my bachelor-thesis. Although we only discussed the topic of my thesis very roughly, the main problem will be something like this:
A program written in C++ (more or less a HTTP-Server, but I guess it doesn't matter here) has to be executed to fulfill its task. There are several instances of this program running at the same time, and a loadbalancer takes care of equal distribution of http-requests between all instances. Every time the program's code is changed to enhance it, or to get rid of bugs, all instances have to be restarted. This can take up to 40 minutes, for one instance. As there are more than ten instances running, the restart process can take up to one work day. This is way to slow.
The presumed bottleneck is the access to the database during startup to load all necessary data (guess it will be a mysql-database). The idea of the teamleader to decrease the amount of time needed for the startup-process is to serialize the content of the database to a file, and read from this file instead of reading from the database. That would be my task. Of course the problem is to check if there is new data in the database, that is not in the file. I guess write processes are still applied to the database, not to the serialized file. My first idea is to use apache thrift for serialization and deserialization, as I already worked with it and it is fast, as far as I know (maybe i write some small python programm, to take care of this). However, I have some basic questions regarding this problem:
Is it a good solution to read from file instead of reading from database. Is there any chance this will save time?
Would thrift work well in this scenario, or is there some faster way for serialization/deserialization
As I am only reading, not writing, I don't have to take care of consistency, right?
Can you recommend some books or online literature that is worth to read regarding this topic.
If I'm missing Information, just ask. Thanks in advance. I just want to be well informed and prepared before I start with the thesis, this is why I ask.
Kind regards
Michael

Cache is king
As a general recommendation: Cache is king, but don't use files.
Cache? What cache?
The cache I'm talking about is of course an external cache. There are plenty of systems available, a lot of them are able to form a cache cluster with cached items spread across multiple machine's RAM. If you are doing it cleverly, the cost of serializing/deserializing into memory will make your algorithms shine, compared to the cost of grinding the database. And on top of that, you get nice features like TTL for cached data, a cache that persists even if your business logic crashes, and much more.
What about consistency?
As I am only reading, not writing, I don't have to take care of consistency, right?
Wrong. The issue is not, who writes to the database. It is about whether or not someone writes to the database, how often this happens, and how up-to-date your data need to be.
Even if you cache your data into a file as planned in your question, you have to be aware that this produces a redundant data duplicate, disconnected from the original data source. So the real question you have to answer (I can't do this for you) is, what the optimum update frequency should be. Do you need immediate updates in near-time? Is a certain time lag be acceptable?
This is exactly the purpose of the TTL (time to live) value that you can put onto your cached data. If you need more frequent updates, set a short TTL. If you are ok with updates in a slower frequency, set the TTL accordingly or have a scheduled task/thread/process running that does the update.
Ok, understood. Now what?
Check out Redis, or the "oldtimer" Memcached. You didn't say much about your platform, but there are Linux and Windows versions available for both (and especially on Windows you will have a lot more fun with Redis).
PS: Oh yes, Thrift serialization can be used for the serialization part.

Related

Fast, binary database alternative [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to implement a fast database alternative that only needs to handle binary data.
To specify, I want something close to a database that will be securely stored even in case of a forced termination (task manager) during execution, whilst also being accessed directly from memory in C++. Like a vector of structs that is mirrored to the hard disk. It should be able to handle hundreds of thousands of read accesses and at least 1000 write accesses per second. In case of a forced termination, at most the last command can be lost. It does not need to support multithreading and the database file will only be accessed by a single instance of the program. Only needs to run on Windows. These are the solutions I've thought of so far:
SQL Databases
Advantages
Easy to implement, since lots of libraries are available
Disadvantages
Server is on a different process, therefor possibly slow inter process communication
Necessity of parsing SQL queries
Built for multithreaded environments, so lots of unnecessary synchronization
Rows can't be directly accessed using pointers but need to be copied at least twice per change
Unnecessary delays on the UPDATE query, since the whole table needs to be searched and the WHERE case checked
These were just a few from the top of my head, there might be a lot more
Memory Mapped Files
Advantages
Direct memory mapping, so direct pointer access possible
Very fast compared to databases
Disadvantages
Forceful termination could lead to a whole page not being written
Lots of code (I don't actually mind that)
No forced synchronization possible
Increasing file size might take a lot of time
C++ vector*
Advantages
Direct pointer access possible, however, needs to manually notify of changes
Very fast compared to databases
Total programming freedom
Disadvantages
Possibly slow because of many calls to WriteFile
Lots of code (I don't actually mind that)
C++ vector with complete write every few seconds
Advantages
Direct pointer access possible
Very fast compared to databases
Total programming freedom
Disadvantages
Lots of unchanged data being rewritten to file, alternatively lots of RAM wasted on preventing unnecessary writes
Inaccessibility during writes of lots of RAM wasted on copy
Could lose multiple seconds worth of data
Multiple threads and therefor synchronization needed
*Basically, a wrapper class that only exposes per row read/write functionality of a vector OR allows direct write to memory, but relies on the caller to notify of changes, all reads are done from a copy in memory, all writes are done to a copy in memory and the file itself on a per-command basis
Also, is it possible to write to different parts of a file without flushing, and then flushing all changes at once with a guarantee that the file will be written either completely or not at all even in case of a forced termination during write? All I can think of is the following workflow:
Duplicate target file on startup, then for every set of data:
Write all changes to duplicate -> Flush by replacing original with duplicate
However, I feel like this would be a horrible waste of hard disk space for big files.
Thanks in advance for any input!

CPU utilization degradation over time [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a multi-threaded process. Each thread is CPU bound (performs calculations) and also uses a lot of memory. The process starts with 100% cpu utilization according to resource monitor, but after several hours, cpu utilization starts to degrade, slowly. After 24 hours, it's on 90-95% and falling.
The question is - what should I look for, and what best-known-methods can I use to debug this?
Additional info:
I have enough RAM - most of it is unused at any given moment.
According to perfmon - memory doesn't grow (so I don't think it's leaking).
The code is a mix of .Net and native c++, with some data marshaling back and forth.
I saw this on several different machines (servers with 24 logical cores).
One thing I saw in perfmon - Modified Page List Bytes indicator increases over time as CPU utilization degrades.
Edit 1
One of the third party libraries that is used is openfst. Looks like it's very related to some mis-usage of that library.
Specifically, I noticed that I have the following warnings:
warning LNK4087: CONSTANT keyword is obsolete; use DATA
Edit 2
Since the question is closed, and wasn't reopened, I will write my findings and how the issue was solved in the body of the question (sorry) for future users.
Turns out there is an openfst.def file that defines all the openfst FLAGS_* symbols to be used by consuming applications/dlls. I had to fix those to use the keyword "DATA" instead of "CONSTANT" (CONSTANT is obsolete because it's risky - more info: https://msdn.microsoft.com/en-us/library/aa271769(v=vs.60).aspx).
After that - no more degradation in CPU utilization was observed. No more rise in "modified page list bytes" indicator. I suspect that it was related to the default values of the FLAGS (specifically the garbage collection flags - FLAGS_fst_default_cache_gc) which were non deterministic because of the misusage of CONSTANT keyword in openfst.def file.
Conclusion Understand your warnings! Eliminate as much of them as you can!
Thanks.

For a non-obvious issue like this, you should also use a profiler that actually samples the underlying hardware counters in the CPU. Most profilers that I’m familiar with use kernel supplied statistics and not the underlying HW counters. This is especially true in Windows. (The reason is in part legacy, and in part that Windows wants its kernel statistics to be independent of hardware. PAPI APIs attempt to address this but are still relatively new.)
One of the best profilers is Intel’s VTune. Yes, I work for Intel but the internal HPC people use VTune as well. Unfortunately, it costs. If you’re a student, there are discounts. If not, there is a trial period.
You can find a lot of optimization and performance issue diagnosis information at software.intel.com. Here are pointers for optimization and for profiling. Even if you are not using an x86 architecture, the techniques are still valid.
As to what might be the issue, a degradation that slow is strange.
How often do you use new memory or access old? At what rate? If the rate is very slow, you might still be running into a situation where you are slowing using up a resource, e.g. pages.
What are your memory access patterns? Does it change over time? How rapidly? Perhaps your memory access patterns over time are spreading, resulting in more cache misses.
Perhaps your partitioning of the problem space is such that you have entered a new computational domain and there is no real pathology.
Look at whether there are periodic maintenance activities that take place over a longer interval, though this would result in a periodic degradation, say every 24 hours. This doesn’t sound like your situation since you are experiencing is a gradual degradation.
If you are using an x86 architecture, consider submitting a question in an Intel forum (e.g. "Intel® Clusters and HPC Technology" and "Software Tuning, Performance Optimization & Platform Monitoring").
Let us know what you ultimately find out.

Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes.
What concept should i use?
Should I use a shared memory, or is there some other concept that suits my requirement better.

You are asking for persistence (or even orthogonal persistence) and/or for application checkpointing.
This is not possible (at least thru portable C++ code) in the general case for some arbitrary existing C++ code, e.g. because of ASLR, because of pointers on -or to- the local call stack, because of multi-threading, and because of external resources (sockets, opened files, ...), because the current continuation cannot be accessed, restored and handled in standard C++.
However, you might design your application with persistence in mind. This is a strong architectural requirement. You could for instance have every class contain some dumping method and its load factory function. Beware of shared pointers, and take into account that you could have cyclic references. Study garbage collection algorithms (e.g. in the Gc HandBook) which are similar to those needed for persistence (a copying GC is quite similar to a checkpointing algorithm).
Look also in serialization libraries (like libs11n). You might also consider persisting into textual format (e.g. JSON), perhaps inside some Sqlite database (or some real database like PostGreSQL or MongoDb....). I am doing this (in C) in my monimelt software.
You might also consider checkpointing libraries like BLCR
The important thing is to think about persistence & checkpointing very early at design time. Thinking of your application as some specialized bytecode interpreter or VM might help (notably if you want to persist continuations, or some form of "call stack").
You could fork your process (assuming you are on Linux or Posix) before persistence. Hence, persistence time does not matter that much (e.g. if you persist every hour or every ten minutes).
Some language implementations are able to persist their entire state (notably their heap), e.g. SBCL (a good Common Lisp implementation) with its save-lisp-and-die, or Poly/ML -an ML dialect- with its SaveState, or Squeak (a Smalltalk implementation).
See also this answer & that one. J.Pitrat's blog has a related entry: CAIA as a sleeping beauty.
Persistency of data with code (e.g. vtables of objects, function pointers) might be technically difficult. dladdr(3) -with dlsym- might help (and, if you are able to code machine-specific things, consider the old getcontext(3), but I don't recommend that). Avoid name mangling (for dlsym) by declaring extern "C" all code related to persistence. If you want to persist some data and be able to restart from it with a slightly modified program (e.g. a small bugfix) things are much more complex.
More pragmatically, you could have a class representing your entire persistable state, and implement methods to persist (and reload it). You would then persist only at certain steps of your algorithm (e.g. if you have a main loop or an event loop, at start of that loop). You probably don't want to persist too often (e.g. because of the time and disk space required to persist), e.g. perhaps every ten minutes. You might perhaps consider some transaction log if it fits in the overall picture of your application.

Use memory mapped files - mmap (https://en.wikipedia.org/wiki/Mmap) And allocate all your structures inside mapped memory region. System will properly save mapped file to disk when your app crashes.

Losing Power While Writing to a File [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project for a device that must constantly write information to a storage device. The device will need to be able to lose power but accurately retain the information that it collects up until the time that power is lost.
I've been looking for answers for what would happen if power was lost on a system like this. Are there any issues with losing power and not closing the file? Is data corruption a possibility?
Thank you

The whole subject of "safely storing data when power may be cut" is quite hard to solve in a generic way - the exact solution will depend on the exact type of data, rate data is stored, etc.
To retain information "while power is off", the data needs to be stored in non-volatile memory (either flash, eeprom or battery backed RAM). Again, this is a hardware solution.
If you may "lose data written to file"? Yes, it's entirely possible that the file may not be correctly written if the power to the file-storage device is lost when the system is in the middle of writing.
The answer to this really depends on how much freedom you have to build/customise the hardware to cope with this situation. Systems that are designed for high reliability will have a way to detect power-cuts and still run for several seconds (sometimes a lot more) after a power-cut, and when the power-cut happens, it goes into "save all data, and shut down nicely" mode. Typically, this is done by using an uninterruptable power supply (UPS), which has an alarm mechanism that signals that the external power is gone, and when the system receives this signal, starts a emergency shutdown.
If you don't have any way to connect a UPS and shut down in an orderly fashion, then there's other features, such as journaling filesystem that can give you a good set of data, but it's not guaranteed to give you complete data (and you need to handle your fileformat such that "cut off data" doesn't completely ruin the file - the classic example is a zip-file, which stores the "directory" (list of contents) at the very end of the file. So you can have 99.9% of the file complete, but the missing 0.1% is what you need to decode all the content.

Yes, data corruption is definitely a possibility.
However there are a few guidelines to minimize it in a purely software way:
Use a journalling filesystem and put it in its maximum journal mode (eg. for ext3/ext4 use data=journal, no less).
Avoid software buffers. If you don't have a choice, flush them ASAP.
Synchronize the filesystem ASAP (either through the sync/syncfs/fsync system calls, or using the sync mount option).
Never overwrite existing data, just append new data to existing files.
Be prepared to deal with incomplete data records.
This way, even if you lose data it will only be the last few bytes written, and the filesystem in general won't be corrupt.
You'll notice that I assumed a Unix-y OS. As far as I know, Windows doesn't give you enough control to enforce that kind of constraints on the filesystem.

seeking recommendations for simple in-memory DB server (no persistence needed) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Should support multiple connections, preferably via ODBC. The clients will be running as separate processes on the same machine. No need for persistence, as the the clients will handle the persistence elsewhere. The clients are written in C++ if it matters.
The data is quite simple, it is a set of unrelated bi-directional maps. The access is either directly by a value or by a range (between X and Y), no updates. We don't actually need SQL here, so non-SQL solutions can also be considered.
The client application is multi-process and can run on several machines. Each machine should have a local copy of such DB, which is updated against the central store by its local clients.
Multiple Edits:
the platform is Linux
RAM disk is not an option for security reasons - we don't want that anyone with access to the machine would be able to view the data
the data should may be persisted only in encrypted form, so the solution should either not persist the data at all or allow a user-defined filter/plugin for persistence.

Just because of my familiarity with it, I would go with mysql. To use it as an in memory database use memory as the table type. Redis is an in memory NoSql Database that would probably be a perfect fit for this(It runs in memory, with disc writing for persistence only which can be disabled).

Any particular reason for not using SQLite with a RAM db opened?

RAM disk is not an option for security reasons - we don't want that anyone with access to the machine would be able to view the data
You're out of luck. Anyone with access to the machine can view the data out of /proc/$PID/mem anyway.
If you're talking nonroot access than use the /tmp/$directory/ method with chmod 700.

Here is a trick you can use under Linux, it is called "Lazy unmount".
Mount a tmpfs somewhere
Start up some process(es) to use it, which chdir() into that directory. You can use a mysql instance; mysql always does a chdir to its data directory.
After the process has started up successfully, umount the tmpfs with the -l (lazy) option.
Now the tmpfs still exists and will continue to exist as long as process(es) are accessing it, but it cannot be accessed by unrelated processes any more as it is no longer present in its mount-point.
Note that this in no way stops root from obtaining the data in the tmpfs, just makes it a little harder.
Also note that it may be swapped, so you should disable swap (or use encrypted swap) if you absolutely need it to be non-persistent.

Try Boost.MultiIndex. Not an obvious choice but it's based on relational DB concepts.
The concept of multi-indexing over the
same collection of elements is
borrowed from relational database
terminology and allows for the
specification of complex data
structures in the spirit of multiply
indexed relational tables where simple
sets and maps are not enough. A wide
selection of indices is provided,
modeled after analogous STL containers
like std::set, std::list and hashed
sets.

Use mysql with datadir= a tmpfs you've mounted for the purpose. You'll need to cook up some startup script which installs the database (using mysql_install_db or something) at boot-time, of course, as you'll lose all the data.

memcached can be a viable solution. It is a key-value store that can be set to hold values for a certain amount of time, is scalable and is simple to set up and start using. It is also run in all sorts of environments. Here's the wiki site for more information, especially since the main page isn't nearly has helpful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js