Fast, binary database alternative [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I want to implement a fast database alternative that only needs to handle binary data.
To specify, I want something close to a database that will be securely stored even in case of a forced termination (task manager) during execution, whilst also being accessed directly from memory in C++. Like a vector of structs that is mirrored to the hard disk. It should be able to handle hundreds of thousands of read accesses and at least 1000 write accesses per second. In case of a forced termination, at most the last command can be lost. It does not need to support multithreading and the database file will only be accessed by a single instance of the program. Only needs to run on Windows. These are the solutions I've thought of so far:
SQL Databases
Advantages
Easy to implement, since lots of libraries are available
Disadvantages
Server is on a different process, therefor possibly slow inter process communication
Necessity of parsing SQL queries
Built for multithreaded environments, so lots of unnecessary synchronization
Rows can't be directly accessed using pointers but need to be copied at least twice per change
Unnecessary delays on the UPDATE query, since the whole table needs to be searched and the WHERE case checked
These were just a few from the top of my head, there might be a lot more
Memory Mapped Files
Advantages
Direct memory mapping, so direct pointer access possible
Very fast compared to databases
Disadvantages
Forceful termination could lead to a whole page not being written
Lots of code (I don't actually mind that)
No forced synchronization possible
Increasing file size might take a lot of time
C++ vector*
Advantages
Direct pointer access possible, however, needs to manually notify of changes
Very fast compared to databases
Total programming freedom
Disadvantages
Possibly slow because of many calls to WriteFile
Lots of code (I don't actually mind that)
C++ vector with complete write every few seconds
Advantages
Direct pointer access possible
Very fast compared to databases
Total programming freedom
Disadvantages
Lots of unchanged data being rewritten to file, alternatively lots of RAM wasted on preventing unnecessary writes
Inaccessibility during writes of lots of RAM wasted on copy
Could lose multiple seconds worth of data
Multiple threads and therefor synchronization needed
*Basically, a wrapper class that only exposes per row read/write functionality of a vector OR allows direct write to memory, but relies on the caller to notify of changes, all reads are done from a copy in memory, all writes are done to a copy in memory and the file itself on a per-command basis
Also, is it possible to write to different parts of a file without flushing, and then flushing all changes at once with a guarantee that the file will be written either completely or not at all even in case of a forced termination during write? All I can think of is the following workflow:
Duplicate target file on startup, then for every set of data:
Write all changes to duplicate -> Flush by replacing original with duplicate
However, I feel like this would be a horrible waste of hard disk space for big files.
Thanks in advance for any input!

Related

How to overwrite all the free disk space with 0x00? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
how to overwrite all free disk space with zeros, like the cipher command in Windows; for example:
cipher /wc:\
This will overwrite the free disk space in three passes. How can I do this in C or C++? (I want to this in one pass and as fast as possible.)
You can create a set a files and write random bytes to them until available disk space is filled. These files should be removed before exiting the program.
The files must be created on the device you wish to clean.
Multiple files may be required on some file systems, due to file size limitations.
It is important to use different non repeating random sequences in these files to avoid file system compression and deduplicating strategies that may reduce the amount of disk space actually written.
Note also that the OS may have quota systems that will prevent you from filling available disk space and may also show erratic behavior when disk space runs out for other processes.
Removing the files may cause the OS to skip the cache flushing mechanism, causing some blocks to not be written to disk. A sync() system call or equivalent might be required. Further synching at the hardware level might be delayed, so waiting for some time before removing the files may be necessary.
Repeating this process with a different random seed improves the odds of hardware recovery through surface analysis with advanced forensic tools. These tools are not perfect, especially when recovery would be a life saver for a lost Bitcoin wallet owner, but may prove effective in other more problematic circumstances.
Using random bytes has a double purpose:
prevent some file systems from optimizing the blocks and compress or share them instead of writing to the media, thus overwriting existing data.
increase the difficulty in recovering previously written data with advanced hardware recovery tools, just like these security envelopes that have random patterns printed on the inside to prevent exposing the contents of the letter by simply scanning the envelope over a strong light.

What to do without generating instances with "new" [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
In some video games, I find that everytime a new character is created a factory method is used create the new one like this
class CharacterEngine
{
public:
static Character* CreateCharacter(string Name, Weapons InitialWeapons)
{
return new Character(Name, InitialWeapons);
}
};
//...
Now that if I have 100000000 characters (very many, e.g like simulated particles), heap allocation like this may fail to work on computers with small RAM. What is your solution to this problem?
Edit
What other methods or designs do you know can change or replace the factory method/class?
Do you actually have 100K characters? And are you actually in an environment in which you are memory constrained and allocation fails? Even if Character is a whopping 1KB in side, you'd be looking at 100MBs consumed, which isn't that much, even for feature phones.
But perhaps you're worried that you might actually have memory to spare, but fragmentation is so high you can't use it. That's a fairer concern, and one usually relevant to games. Perhaps take a look at the object pool pattern. Also, taking into consideration the large number of characters you're speaking of, perhaps flighweight might also help!
Finally, running out of memory isn't like other program errors like losing a TCP connection or facing a disk error. If you need to allocate the 100001th character and there's no more memory for it you can't not allocate it, show an error to the user or try again later. You can't go on without them as it were. So don't - just bail the program and perhaps do whatever cleaning up is required to not lose too much game state etc. Have a read for malloc never fails as well.
The heap memory is obviously limited, but the limit is in practice not that small (at least gigabytes on current PCs).
And memory consumption is not the biggest problem in a game. If you have many characters, you might need to deal with interactions between them, and that could be more difficult (e.g. determining the set of characters close to a given one could be more challenging).
You should read more about memory management, virtual address space, smart pointers, reference counting, RAII, circular references, weak references, hash consing.
Notice that the heap is global to your program & process (it is not the property of some particular class or code chunk, but of your entire program).
The heap allocation routines (related to new & delete) are generally implemented about some operating system primitives (often system calls) to grow the virtual address space. On Linux, see mmap(2). The operating system could provide some mean to query your virtual address space (on Linux, see proc(5) and for a process of pid 1234, the /proc/1234/maps pseudo-file).
I recommend reading a good book on garbage collection, such as the GC handbook. It teaches you concepts and terminology which are relevant for C++ programming (notably in games). In some sense, you may want to implement your own GC for your game.
C++ has some allocator concept and standard containers know about that.
Read also some Introduction to Algorithms.
heap allocation like this may fail to work on computers with small RAM.
Then either improve your program to use less memory, or get a bigger computer. Perhaps consider some distributed computing approach (e.g. cloud computing), like in MMORG.
What other methods or designs do you know can change or replace the factory method/class?
They won't change much the consumed memory, because in your design every character is represented by its unique C++ object. So that does not matter much.
Assuming you have enough local disk space to store the information of all characters, you can mmap one or more files that store all the character data, and create a character object from data in the file(s) only when needed.
If you have neither enough memory nor disk space to store data of all characters locally, then it becomes a much more difficult problem -- you might need to assign to every character an URI and load it from the network...
EDIT: Of course, after updating the character data, you'd need to write it back to the corresponding file. And for performance sake, you might want to implement some caching mechanism so frequently used characters don't need to be read & written back every time they are used.

Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Wants to create an application storing data in memory. But i dont want the data to be lost even if my app crashes.
What concept should i use?
Should I use a shared memory, or is there some other concept that suits my requirement better.
You are asking for persistence (or even orthogonal persistence) and/or for application checkpointing.
This is not possible (at least thru portable C++ code) in the general case for some arbitrary existing C++ code, e.g. because of ASLR, because of pointers on -or to- the local call stack, because of multi-threading, and because of external resources (sockets, opened files, ...), because the current continuation cannot be accessed, restored and handled in standard C++.
However, you might design your application with persistence in mind. This is a strong architectural requirement. You could for instance have every class contain some dumping method and its load factory function. Beware of shared pointers, and take into account that you could have cyclic references. Study garbage collection algorithms (e.g. in the Gc HandBook) which are similar to those needed for persistence (a copying GC is quite similar to a checkpointing algorithm).
Look also in serialization libraries (like libs11n). You might also consider persisting into textual format (e.g. JSON), perhaps inside some Sqlite database (or some real database like PostGreSQL or MongoDb....). I am doing this (in C) in my monimelt software.
You might also consider checkpointing libraries like BLCR
The important thing is to think about persistence & checkpointing very early at design time. Thinking of your application as some specialized bytecode interpreter or VM might help (notably if you want to persist continuations, or some form of "call stack").
You could fork your process (assuming you are on Linux or Posix) before persistence. Hence, persistence time does not matter that much (e.g. if you persist every hour or every ten minutes).
Some language implementations are able to persist their entire state (notably their heap), e.g. SBCL (a good Common Lisp implementation) with its save-lisp-and-die, or Poly/ML -an ML dialect- with its SaveState, or Squeak (a Smalltalk implementation).
See also this answer & that one. J.Pitrat's blog has a related entry: CAIA as a sleeping beauty.
Persistency of data with code (e.g. vtables of objects, function pointers) might be technically difficult. dladdr(3) -with dlsym- might help (and, if you are able to code machine-specific things, consider the old getcontext(3), but I don't recommend that). Avoid name mangling (for dlsym) by declaring extern "C" all code related to persistence. If you want to persist some data and be able to restart from it with a slightly modified program (e.g. a small bugfix) things are much more complex.
More pragmatically, you could have a class representing your entire persistable state, and implement methods to persist (and reload it). You would then persist only at certain steps of your algorithm (e.g. if you have a main loop or an event loop, at start of that loop). You probably don't want to persist too often (e.g. because of the time and disk space required to persist), e.g. perhaps every ten minutes. You might perhaps consider some transaction log if it fits in the overall picture of your application.
Use memory mapped files - mmap (https://en.wikipedia.org/wiki/Mmap) And allocate all your structures inside mapped memory region. System will properly save mapped file to disk when your app crashes.

Fastest/Best way to serialize and deserialize data from database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In a few months I will start to write my bachelor-thesis. Although we only discussed the topic of my thesis very roughly, the main problem will be something like this:
A program written in C++ (more or less a HTTP-Server, but I guess it doesn't matter here) has to be executed to fulfill its task. There are several instances of this program running at the same time, and a loadbalancer takes care of equal distribution of http-requests between all instances. Every time the program's code is changed to enhance it, or to get rid of bugs, all instances have to be restarted. This can take up to 40 minutes, for one instance. As there are more than ten instances running, the restart process can take up to one work day. This is way to slow.
The presumed bottleneck is the access to the database during startup to load all necessary data (guess it will be a mysql-database). The idea of the teamleader to decrease the amount of time needed for the startup-process is to serialize the content of the database to a file, and read from this file instead of reading from the database. That would be my task. Of course the problem is to check if there is new data in the database, that is not in the file. I guess write processes are still applied to the database, not to the serialized file. My first idea is to use apache thrift for serialization and deserialization, as I already worked with it and it is fast, as far as I know (maybe i write some small python programm, to take care of this). However, I have some basic questions regarding this problem:
Is it a good solution to read from file instead of reading from database. Is there any chance this will save time?
Would thrift work well in this scenario, or is there some faster way for serialization/deserialization
As I am only reading, not writing, I don't have to take care of consistency, right?
Can you recommend some books or online literature that is worth to read regarding this topic.
If I'm missing Information, just ask. Thanks in advance. I just want to be well informed and prepared before I start with the thesis, this is why I ask.
Kind regards
Michael
Cache is king
As a general recommendation: Cache is king, but don't use files.
Cache? What cache?
The cache I'm talking about is of course an external cache. There are plenty of systems available, a lot of them are able to form a cache cluster with cached items spread across multiple machine's RAM. If you are doing it cleverly, the cost of serializing/deserializing into memory will make your algorithms shine, compared to the cost of grinding the database. And on top of that, you get nice features like TTL for cached data, a cache that persists even if your business logic crashes, and much more.
What about consistency?
As I am only reading, not writing, I don't have to take care of consistency, right?
Wrong. The issue is not, who writes to the database. It is about whether or not someone writes to the database, how often this happens, and how up-to-date your data need to be.
Even if you cache your data into a file as planned in your question, you have to be aware that this produces a redundant data duplicate, disconnected from the original data source. So the real question you have to answer (I can't do this for you) is, what the optimum update frequency should be. Do you need immediate updates in near-time? Is a certain time lag be acceptable?
This is exactly the purpose of the TTL (time to live) value that you can put onto your cached data. If you need more frequent updates, set a short TTL. If you are ok with updates in a slower frequency, set the TTL accordingly or have a scheduled task/thread/process running that does the update.
Ok, understood. Now what?
Check out Redis, or the "oldtimer" Memcached. You didn't say much about your platform, but there are Linux and Windows versions available for both (and especially on Windows you will have a lot more fun with Redis).
PS: Oh yes, Thrift serialization can be used for the serialization part.

What is cache in C++ programming? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Firstly I would like to tell that I come from a non-Computer Science background & I have been learning the C++ language.
I am unable to understand what exactly is a cache?
It has different meaning in different contexts.
I would like to know what would be called as a cache in a C++ program?
For example, if I have some int data in a file. If I read it & store in an int array, then would this mean that I have 'cached' the data?
To me this seems like common sense to use the data since reading from a file is always bad than reading from RAM.
But I am a little confused due to this article.
In a CPU there can be several caches, to speed up instructions in
loops or to store often accessed data. These caches are small but very
fast. Reading data from cache memory is much faster than reading it
from RAM.
It says that reading data from cache is much faster than from RAM.
I thought RAM & cache were the same.
Can somebody please clear my confusion?
EDIT: I am updating the question because previously it was too broad.
My confusion started with this answer. He says
RowData and m_data are specific to my implementation, but they are
simply used to cache information about a row in the file
What does cache in this context mean?
Any modern CPU has several layers of cache that are typically named things like L1, L2, L3 or even L4. This is called a multi-level cache. The lower the number, the faster the cache will be.
It's important to remember that the CPU runs at speeds that are significantly faster than the memory subsystem. It takes the CPU a tiny eternity to wait for something to be fetched from system memory, many, many clock-cycles elapse from the time the request is made to when the data is fetched, sent over the system bus, and received by the CPU.
There's no programming construct for dealing with caches, but if your code and data can fit neatly in the L1 cache, then it will be fastest. Next is if it can fit in the L2, and so on. If your code or data cannot fit at all, then you'll be at the mercy of the system memory, which can be orders of magnitude slower.
This is why counter-intuitive things like unrolling loops, which should be faster, might end up being slower because your code becomes too large to fit in cache. It's also why shaving a few bytes off a data structure could pay huge dividends even though the memory footprint barely changes. If it fits neatly in the cache, it will be faster.
The only way to know if you have a performance problem related to caching is to benchmark very carefully. Remember each processor type has varying amounts of cache, so what might work well on your i7 CPU might be relatively terrible on an i5.
It's only in extremely performance sensitive applications that the cache really becomes something you worry about. For example, if you need to maintain a steady 60FPS frame rate in a game, you'll be looking at cache problems constantly. Every millisecond counts here. Likewise, anything that runs the CPU at 100% for extended periods of time, such as rendering video, will want to pay very close attention to how much they could gain from adjusting the code that's emitted.
You do have control over how your code is generated with compiler flags. Some will produce smaller code, some theoretically faster by unrolling loops and other tricks. To find the optimal setting can be a very time-consuming process. Likewise, you'll need to pay very careful attention to your data structures and how they're used.
[Cache] has different meaning in different contexts.
Bingo. Here are some definitions:
Cache
Verb
Definition: To place data in some location from which it can be more efficiently or reliably retrieved than its current location. For instance:
Copying a file to a local hard drive from some remote computer
Copying data into main memory from a file on a local hard drive
Copying a value into a variable when it is stored in some kind of container type in your procedural or object oriented program.
Examples: "I'm going to cache the value in main memory", "You should just cache that, it's expensive to look up"
Noun 1
Definition: A copy of data that is presumably more immediately accessible than the source data.
Examples: "Please keep that in your cache, don't hit our servers so much"
Noun 2
Definition: A fast access memory region that is on the die of a processor, modern CPUs generally have several levels of cache. See cpu cache, note that GPUs and other types of processors will also have their own caches with different implementation details.
Examples: "Consider keeping that data in an array so that accessing it sequentially will be cache coherent"
My definition for Cache would be some thing that is in limited amount but faster to access as there is less area to look for. If you are talking about caching in any programming language then it means you are storing some information in form of a variable(variable is nothing a way to locate your data in memory) in memory. Here memory means both RAM and physical cache (CPU cache).
Physical/CPU cache is nothing but memory that is even more used than RAM, it actually stores copies of some data on RAM which is used by CPU very often. You have another level of categorisation after that as well which is on board cache(faster) and off-board cache. youu can see this link
I am updating the question because previously it was too broad. My
confusion started with this answer. He says
RowData and m_data are specific to my implementation,
but they are simply used to cache information about a row in the file
What does cache in this context mean?
This particular use means that RowData is held as a copy in memory, rather than reading (a little bit of) the row from a file every time we need some data from it. Reading from a file is a lot slower [1] than holding on to a copy of the data in our program's memory.
[1] Although in a modern OS, the actual data from the hard-disk is probably held in memory, in file-system cache, to avoid having to read the disk many times to get the same data over and over. However, this still means that the data needs to be copied from the file-system cache to the application using the data.