Unit tests in systems programming? - c++

I would like to start learning/using unit tests in C++. However, I'm having a hard time applying the concept of tests to my field of programming.
I'm usually not writing functions which follow a predefined input/output pattern, instead, my programming is usually on a level rather close to the operating system.
Some examples are: find out Windows version, create a system restore point, query registry for installed drives, compress a file, or recursively find all .log files older than X days.
I don't see how I could hard-code "results" into a testing function. Are unit tests even possible in my case?

The "result" doesn't have to be a CONSTANT value, it could be something that the code finds out. For example, if you are compressing a file, the result would be a file that, when uncompressed, gives you the original file. So the test would be to take an existing test-file , compress it, and then uncompress the resulting compressed file, then compare the two files. If the result is "no difference", it's a pass. If the files are not the same, you have a problem of some sort.
The same principle can be applied to any of your other methods. Finding log-files would of course require that you prepare a number of files, and given them different times (using the SetFileTime or some such).
Getting Windows version should give you the version of the Windows you are currently using.
And so on.
Of course, you should also have "negative" tests whenever possible. If you are compressing a file, what happens if you try to compress a file that doesn't exist? What if the disk is full (using a virtual harddisk or similar can help here, as filling your entire disk may not result in something great!). If the specification says the code should behave in a certain way, then verify that it gives the correct error message. Otherwise, at least ensure it doesn't "crash", or fail without an error message of some sort.

You can also have some mocking objects which will fake the OS calls:
You may have a class OS which have methods which mimics the system calls.
So your algorithm don't call directly the global os function.
then you can construct a fake OS class which returns some sort of "hard coded" values for the testing.

Related

How to use flat buffers when the schema is not fixed?

Current working of my C++ application is as follows:
1. It involves launching another process and uses windows shared memory to communicate between the two processes.
2. The data is serialized in one process and de-serialized in another process. However, the data type could also vary based on the user inputs, and hence the type is also serialized so that deserializer could interpret the data correctly.
Now, I am intending to use flat-buffer to serialize and deserialize data (because of its obvious advantages - random access and backward compatibility).
However, to do that I need clarity in some areas and hoping for some help on them.
Based on the data type, I can programmatically generate schema and feed it to flatc.exe to generate files. However, instead of using flatc.exe, I am thinking to build flatc.dll (from the open source code) and use that to keep the interaction simpler. Does that sound wiser?
Secondly, what I am more unsure is of the following. I will create a schema and invoke 'Flat Buffer compiler' while the application is running. It will generate some C++ files. Now, as much as I understand I would need to build those files somehow and the built binary should be plugged in both serializer and deserializer to serialize and deserialize the actual data- and this is all while the application is running. How do I achieve all this? This problem is all stemming from the fact that my application does not have any fixed schema. What is the general approach to using flat buffers when the schema is variable?
I hope I am clear about what I am intending to ask. If not, please let me know. I will be happy to provide more details. Thanks for your answers in advance.
The answer is that you do not want this. While it is feasible, especially runtime generation of C++, compiling it into a DLL and then loading it back into your process is an extremely clumsy way of doing it.
The data structures of your program must be known at compile time (if it is written in C++), so why can't you define a schema for that just once and compile it ahead of time? Does your program allow the user to "design" data structures at runtime?
For extremely dynamic use cases such as where the user can create arbitrary objects, I'd recommend FlexBuffers (https://google.github.io/flatbuffers/flexbuffers.html). They can be used inside a FlatBuffer to store "unknown" data, or even as their own serialization format. With these, you can serialize objects whose structure is only known at runtime, they have most of the same efficiency properties of FlatBuffers, and you won't need to bundle a C++ compiler with your program :)
Best is a combination of the two, where all compile time known data is stored in FlatBuffers, and the remainder in FlexBuffers.

How to know if files have been changed?

I'm writing a custom C++ program that copies files only if they were changed in the source since the last time they were copied. So I need to know if files in my specific folder were changed.
I was originally thinking about calculating SHA-1 hash on those files, but then this probably means that I have to do this on the entire folder. Plus, what if the size of those files is 100GB. That would mean that I have to calculate SHA-1 on 100GB of data that would probably take some time.
So I'm curious if there's a better way to do this?
You have at least a couple of possibilities.
One would be to use NTFS change journals to track what files have been modified.
Each file also has an "archive" flag associated with it. This is typically used by backup programs. Any time you write to a file, the flag is set. When you copy/back it up, you clear the flag. When you want to see what files to copy/backup, you just check whether the flag is set or clear. Obvious problem: collisions with other backup programs.
There is also a ReadDirectoryChangesW1. This, however, can only detect changes that happen while your code that uses it is running. So, to use it to track changes you need to do something like setting up a service that runs in the background all the time to keep track of changes. Depending on the file and how it gets modified, it's still possible for even this to miss changes that happen during boot (before your service starts executing).
I've listed these in roughly descending order of how well they seem to fit your needs--i.e., change journals are almost certainly the best fit, the archive flag second and ReadDirectoryChangesW (by quite a large margin) the worst fit for your apparent needs.
1. There's also an older FindFirstChangeNotification/FindNextChangeNotification, but they're less versatile and have all the same shortcomings as ReadDirectoryChangesW. At one time they were useful for code that needed to be compatible with Windows 95/98/SE (since those didn't include ReadDirectoryChangesW) but it's been years since there was a good reason to use them.
In comments for other answers, you've stated that you can't use a file-monitoring API (such as FindFirstChangeNotification) since your code may not be running at the time the change occurs.
I would suggest a multi-pronged approach.
If your application is running, use the file monitoring APIs to detect new changes.
On startup or when a new disk appears, check to see if the file size is the same as before. If it isn't, then you know you have a change.
If the file size is the same, you could use the file's archive flags to determine if it has changed. However, the archive flag is easily altered by users and therefore you probably shouldn't rely on it.
Use the file's last altered timestamp. This can be modified by users, but it's more difficult to do.
Use a hash to determine if the file has changed. The hash you pick depends on how important it is to detect changes. If it isn't critical something like CRC32 or MD5 would be sufficient. If it needs to be secure, consider SHA-256. Consider breaking large files into chunks. That way you don't have to hash the whole file before getting a "this changed" result.
This tiered approach lets you skip the expensive hashing whenever you can.
If you want to do it in "real-time", Windows has a native API for that. FindFirstChangeNotifcation()

I need help developing a polymorphism engine - instruction dependency trees

I am currently trying to write a polymorphism engine in C++ to toy-around with a neat anti-hacking stay-alive checking idea I have. However, writing the polymorphism engine is proving rather difficult - I haven't even established how I should go about doing it. The idea is to stream executable code to the user (that is the application I am protecting) and occasionally send them some code that runs some checksums on the memory image and returns it to the server. The problem is that I don't want someone to simply hijack or programatically crack the stay-alive check; instead each one would be generated on the server working off of a simple stub of code and running it through a polymorphism engine. Each stay-alive check would return a value dependant on the check-sum of the data and a random algorithm sneaked inside of the stay-alive check. If the stub returns incorrectly, I know that the stay-alive check has been tampered with.
What I have to work with:
*The executable images PDB file.
*Assembler & disassembler engine of which I have implemented a interface between them which allows to to relocate code etc...
Here are the steps I was thinking of doing and how I might do them. I am using the x86 instruction set on a windows PE executable
Steps I plan on taking(My problem is with step 2):
Expand instructions
Find simple instructions like mov, or push and replace them with a couple instructions which achieve the same end though with more instructions. In this step I'll also add loads of junk-code.
I plan on doing this just by using a series of translation tables in a database. This shouldn't be very difficult to do.
Shuffling
This is the part I have the most trouble with. I need to isolate the code in to functions. Then I need to establish a series of instruction dependancies trees, and then I need to relocate them based upon which one depend on the other. I can find functions by parsing the pdb files, but creating instruction dependancy trees is the tricky part I am totally lost on.
Compress instructions
Compress instructions and implement a series of uncommon & obscure instructions in the process. And, like the first step, do this by using a database of code signatures.
To clarify again, I need help performing step number 2 and am unsure how I should even begin. I have tried making some diagrams but they become very confusing to follow.
OH: And obviously the protected code is not going to be very optimal - but this is just a security project I wanted to play with for school.
I think what you are after for "instruction dependency trees" is data flow analysis. This is classic compiler technology that determines for each code element (primitive operations in a programming language), what information is delivered to it from other code elements. When you are done, you end up with what amounts to a graph with nodes being code elements (in your case, individual instructions) and directed arcs between them showing what information has to flow so that the later elements can execute on results produced by "earlier" elements in the graph.
You can see some examples of such flow analysis at my website (focused on tool that does program analysis; this tool is probably not appropriate for binary analysis but the examples should be helpful).
There's a ton of literature in the compiler books on doing data flow analysis. See any compiler text book.
There are a number of issues you need to handle:
Parsing the code to extract the code elements. You sound like you have access to all the instructions already.
Determining the operands required by, and the values produced, by each code element. This is pretty simple for "ADD register,register", but you may find this daunting for production x86 CPU which has an astonishingly big and crazy instruction set. You have to collect this for every instruction the CPU might execute, and that means all of them pretty much. Nontheless, this is just sweat and a lot of time spent looking at the instruction reference manuals.
Loops. Values can flow from an instruction, through other instructions, back to that same instruction, so the dataflows can form cycles (lots of cycles for complex loops). The dataflow literature will tell you how to handle this in terms of computing the dataflow arcs in the graph. What these mean for your protection scheme I don't know.
Conservative Analysis: You can't get the ideal data flow, because in fact you are analyzing arbitrary algorithms (e.g., a Turing machine); pointers aggravate this problem pretty severely and machine code is full of pointers. So what the data flow analysis engines often do when unable to decide if "x feeds y", is to simply assume "x (may) feed y". The dataflow graph turns conceptually from "x (must) feed y" into the pragmatic "x (may) feed y" type arcs; the literature in fact is full of "must" and "may" algorithms because of this.
Again, the literature tells you many ways to do [conservative] flow analysis (mostly having different degrees of conservatism; in fact the most conservatinve data flow analysis simply says "every x feeds every y"!). What this means in practice for your scheme, I don't know.
There are a lot of people interested in binary code analysis (e.g., the NSA), and they do data flow analysis on machines instructions complete with pointer analysis. You might find this presentation interesting: http://research.cs.wisc.edu/wisa/presentations/2002/0114/gogul/gogul.1.14.02.pdf
I'm not sure if what you are trying helps to prevent tampering a process. If someone attaches a debugger (process) and breaks on the send / recieve functions the checksum of the memory stays intact all shuffeling will stay as it is and the client will be seen as valid even if it isn't. This debugger or injected code is able to manipulate you when you ask what pages are used by your process (so you won't see injected code since it wouldn't tell you the pages in which it resides).
To your actual question:
Couldn't the shuffeling be implemented by relinking the executable. The linker keeps track of all the symbols that a .o file exports and imports. When all the .o files are read the real addresses of the function are put in the imported placeholders. If you put every function in a seperate cpp file and compile them to a .o file. When reordering the .o files in the linker call all the functions will be on a different address and the executable would still run fine.
I tested this with gcc on windows - and it works. By reordering the .o files when linking all functions are put to a different address.
I can find functions by parsing the pdb files, but creating
instruction dependancy trees is the tricky part I am totally lost on.
Impossible. Welcome to the Halting Problem.

Writing raw data "signature" to disk without corrupting filesystem

I am trying to create a program that will write a series of 10-30 letters/numbers to a disk in raw format (not to a file that the OS will read). Perhaps to make my attempt clearer, if you were to open the disk in a hex editor, you would see the 10-30 letters/numbers but a file manager such as Windows Explorer would not see it (because the data is not a file).
My goal is to be able to "sign" a disk with a series of characters and to be able to read and write that "signature" in my program. I understand NTFS signs its partitions with a NTFS flag as do other file systems and I have to be careful to not write my signature to any of those critical parts.
Are there any libraries in C++/C that could help me write at a low level to a disk and how will I know a safe sector to start writing my signature to? To narrow this down, it only needs to be able to write to NTFS, FAT, FAT32, FAT16 and exFAT file systems and run on Windows. Any links or references are greatly appreciated!
Edit: After some research, USB drives allow only 1 partition without applying hacking tricks that would unfold further problems for the user. This rules out the "partition idea" unfortunately.
First, as the commenters said, you should look at why you're trying to do this, and see if it's really a good idea. Most apps which try to circumvent the normal UI the user has for using his/her computer are "bad", in various ways.
That said, you could try finding a well-known file which will always be on the system and has some slack in the block size for the disk, and write to the slack. I don't think most filesystems would care about extra data in the slack, and it would probably even be copied if the file happens to be relocated (more efficient to copy the whole block at the disk level).
Just a thought; dunno how feasible it would be, but seems like it could work in theory.
Though I think this is generally a pretty poor idea, the obvious way to do it would be to mark a cluster as "bad", then use it for your own purposes.
Problems with that:
Marking it as bad is non-trivial (on NTFS bad clusters are stored in a file named something like $BadClus, but it's not accessible to user code (and I'm not even sure it's accessible to a device driver either).
There are various programs to scan for (and attempt to repair) bad clusters/sectors. Since we don't even believe this one is really bad, almost any of these that works at all will find that it's good and put it back into use.
Most of the reasons people think of doing things like this (like tying a particular software installation to a particular computer) are pretty silly anyway.
You'd have to scan through all the "bad" sectors to see if any of them contained your signature.
This is very dangerous, however, zero-fill programs do the same thing so you can google how to wipe your hard drive with zero's in C++.
The hard part is finding a place you KNOW is unused and won't be used.

Visualization from C/C++ via Gnuplot's pipe interface

I am attempting to use the pipe interface to gnuplot (a standard one gnuplot_i.{cpp,hpp}) in order to generate a real time display of values that are continually changing within another program written in C++. This works ok but I wanted to see if anyone had any suggestions for improvement.
This implementation contains a convenience method to plot a single vector and 2 vectors as a 2D plot. It achieved this by writing out to a temporary file via a standard library call to the mktemp function and then using that as input to a gnuplot plot call. This generated too many temporary files and didn't appear to work well when the update rate on the plot is high (maybe IO limited at a point). I have decided to use the '-' pseudo file in the plot call and just send the vectors directly to the pipe (ended with a single line with "e" on it). This works better but is still not great.
Is there a slicker way to do what I am attempting to do than to continually regenerate the plot when the values have changed? How often is it safe to update the plot with new information? Alternatively, maybe there's a much simpler way to achieve what I am trying to do?
#Andy Ross
I have no "requirements" per se. What I meant by slick was that maybe there was a more elegant approach to doing what I was attempting while still using gnuplot. Although elegant is subjective, I find the approach I am presently taking particularly inelegant. What I meant by safe was whether anyone knew at what update rate there would be IO problems (e.g., latency, lock-up of display, etc.) with said approach.
I'd like to avoid using a toolkit for the following reasons (my short-list at least).
I have found that they are generally nontrivial to install properly on different architectures especially as non-root (and when they require dependencies that aren't standard across OSes).
They incur an additional compilation dependency for other people using this software.
There doesn't appear to be any real standard that most people use for this purpose afiak (myself as well as most people I work with generally just saves off log type files and does post run analysis in MATLAB).
I know/learning gnuplot syntax. I do not know superPlottingApiXX's syntax.
The feature set of gnuplot is almost ideal for the types of things I'd like to be able to do with this software.
However, if you have any particular suggestions in terms of C/C++ plotting libraries that seem like a good fit given the above list I am always interested in suggestions (warning: I have already looked around a good bit to find them).
gnuplot-cpp is an object-oriented C++ wrapper interface around a POSIX pipe connection with Gnuplot.
The example file compiled right away and the interfacing code looks decent; I'll be trying it in my current project.
there is a C2gnuplot library I wrote few years ago. It is very simple but may give you some tips. Basically it uses FIFO files to pass data into Gnuplot.It is able to generate animation from the plots. Here is a video created with the app. I hope this will be useful for you.
Slicker? Safe? Can you be more specific about your requirements?
It sounds like you are trying to do an animated visualization with a tool designed for generating static images. If your display is as simple as you say, why not write a quick GUI app (using the toolkit of your choice) to do the drawing instead?