We need to implement a feature to our program that would sync 2 or more watched folders.
In reality, the folders will reside on different computers on the local network, but to narrow down the problem, let's assume the tool runs on a single computer, and has a list of watched folders that it needs to sync, so any changes to one folder should propagate to all others.
There are several problems I've thought about so far:
Deleting files is a valid change, so if folder A has a file but folder B doesn't, it could mean that the file was created in folder A and needs to propagate to folder B, but it could also mean that the file was deleted in folder B and needs to propagate to folder A.
Files might be changed/deleted simultaneously in several directories, and with conflicting changes, I need to somehow resolve the conflicts.
One or more of the folders might be offline at any time, so changes must be stored and later propagated to it when it comes online.
I am not sure what kind of help if any the community can offer here, but I'm thinking about these:
If you know of a tool that already does this, please point it out. Our product is closed-source and commercial, however, so its license must be compatible with that for us to be able to use it.
If you know of any existing literature or research on the problem (papers and such), please link to it. I assume that this problem would have been researched already.
Or if you have general advice on the best way to approach this problem, which algorithms to use, how to solve conflicts, or race conditions if they exists, and other gotchas.
The OS is Windows, and I will be using Qt and C++ to implement it, if no tools or libraries exist.
It's not exceptionally hard. You just need to compare the relevant change journal records. Of course, in a distributed network you have to assume the clocks are synchronized.
And yes, if a complex file (anything you can't parse) is edited while the network is split, you cannot avoid problems. This is known as the
CAP theorem . Your system cannot be Consistent, Always Available and also resistant against Partitioning (going offline)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I would focus on libraries though it can be a general application installation as well.
When we install a library (say C++), a novice user like me probably expects that when we "install" a library, all that source-code gets copied somewhere with few flags and path variables set so that we can directly use #include kind of statements in our own code and start using them.
But by inspection I can say that actually, the exact source-files are not copied but instead pre-compiled object-forms of the files are copied, except for the so called *.h header-files. (Simply because, I cannot find the sourcefiles all over the hard-disk except the headerfiles)
My Questions:
What is the behind scene method, when we "install" something.. what are all the typical locations that get affected by in a 'linux' environment. And the typical importance/use of each of those locations.
What is the difference between "installing" a library and installing a new application into the linux system via "sudo apt-get" or so.
Finally, If I have a custom set of source files which are useful as a library, and want to send them to another system, how would I "install" my own library there, in the same way as above.
Just to clarify, My primary interest is to know from your kind answers and literature-pointers, the bigger picture of a typical installation (an application/a library), to a level that I can crosscheck,learn and re-do if I want to.
(Question was removed, question addressed difference between header and object files) This is more a question of general programming. A header file is just the declaration of classes/functions/etc, it does nothing. All a header file does is say "hey, I exist, this is what I look like." That is to say it's just a declaration of signatures used later in the actual code. The object code is just the compiled and assembled, but not linked code. This diagram does a good job of explaining the steps of what we generally call the "compilation" process, but would better be called the "compilation, assembling, and linking process." Briefly, linking is pulling in all necessary object files, including those needed from the system, to create a running executable which you can use.
(Now question 1) When you think about it, what is installation except the creation and modification of necessary files with the appropriate content? That's what installing is, just placing the new files in the appropriate place, and then modifying configuration files if necessary. As to what "locations" are typically affected, you usually see binaries placed in /bin, /usr/bin and /usr/local/bin; libraries are typically placed in /lib or /usr/lib. Of course this varies, depending. I think you'd find this page on linux system directories to be an educational read. Remember though, anything can be placed pretty much anywhere and still work appropriately as long as you tell other things where to find it, these directories are just used because they keep things organized and allow for assumptions about where items, such as binaries, will be located.
(Now question 2) The only difference is that apt-get generally makes it easier by installing the item you need and keeping track of installed items, also it allows for easy removal of installed items. In terms of the actual installation, if you do it correctly manually then it should be the same. A package manager such as apt-get just makes life easier.
(Now question 3) If you want to do that you could create your own package or if it's less involved, you could just create a script that moves the files to the appropriate locations on the system. However you want to do it, as long as you get the items where they need to be. If you want to create a package yourself, it'd be a great learning experience and there are plenty of tutorials are online. Just find out what package system your flavor of linux uses then look for a tutorial on how to create packages of that type.
So the really big picture, in my opinion, of the installation process is just compilation (if necessary), then the moving of necessary files to their appropriate places on the system, and the modification of existing files on the system if necessary: Put your crap there, let the system know it's there if you need to.
Does linux/Ubuntu OS creates a table, which keeps the entry of every file with its absolute address that is stored on the hard drive?
Just curious to know, because I am planning to make a file searcher program.
I know there are terminal commands like find etc, but as I will program in C I was thinking if there any such thing Ubuntu OS does, if so, how can I access that table?
Update:
As some people mentioned there is no such thing, then If I want to make a file searcher program, I would have to search each and every folder of every directory, starting program root directory. The resultant program will be very sluggish and will perform poorly! So is there a better way? Or my way is good!
The "thing" you describe is commonly called a file system and as you may know there's a choice of file systems available for Linux: ext3, ext4, btrfs, Reiser, xfs, jffs, and others.
The table you describe would probably map quite well onto the inode-directory combo.
From my point of view, the entire management of where files are physically located on the harddisk is none of the user's business, it's strictly the operating system's domain and not something to mess with unless you have an excellent excuse (like you're writing a data recovery program) and very deep knowledge of the file system(s) involved. Moreover, in most cases a file's storage will not be contiguous, but spread over several locations on the disk (fragments).
But the more important question here is probably: what exactly do you hope to achieve by finding files this way?
EDIT: based on OP's comment I think there may be a serious misunderstanding here - I can't see the link between absolute file addresses and a file searcher, but that may be due to a fundamental difference between our respective understanding of "absolute address" in the context of a file system.
If you just want to look at all files in the file system you can either
perform a recursive directory read or
use the database prepared by updatedb as suggested by SmartGuyz
As you want to look into the files anyways - and that's where almost all runtime will be spent on - I can't think of any advantage 2) would have over 1) and 2) has the disadvantage to have an external dependency, in casu the file prepared by updatedb must exist and be very fresh.
An SO question speaking about more advanced ways of traversing directories than good old opendir/readdir/closedir : Efficiently Traverse Directory Tree with opendir(), readdir() and closedir()
EDIT2 based on OP's question addendum: yes, traversing directories takes time, but that's life. Consider the next best thing, ie locate and friends. It depends on a "database" that will be updated regularly (typically once daily), so all files that were added or renamed after the last scheduled update will not be found, and files that were removed after the last scheduled update will be mentioned in the database although they don't exist anymore. Assuming locate is even installed on the target machine, something you can't be sure of.
As with most things in programming, it never hurts to look at previous solutions to the same problem, so may I suggest you read the documentation of GNU findutils?
No, there is no a single table of block addresses of files, you need to go deeper.
First of all, the file layout depends on the filesystem type (e.g. ext2, ext3, btrfs. reisersf, jfs, xfs, etc). This is abstracted by the Linux kernel, which provides drivers for access to files on a lot of filesystems and a specific partition with its filesystem is abstracted under the single Virtual File System (the single file-directory tree, which contains other devices as its subtrees).
So, basically no, you need to use the kernel abstract interfaces (readdir(), /proc/mounts and so on) in order to search for files or roll your own userspace drivers (e.g. through FUSE) to examine raw block devices (/dev/sda1 etc) if you really need to examine low-level details (this requires a lot of understanding of the kernel/filesystems internals and is highly error-prone).
updatedb -l 0 -o db_file -U source_directory
This will create a database with files, I hope this will help you.
No. The file system is actually structured with directories, each directory containing files and directories.
Within Linux, all of this is managed into the kernel with inodes.
YES.
Conceptually, it does create a table of every file's location on the disc**. There are a lot of details which muddy this picture slightly.
However, you should usually not care. You don't want to work at that level, nor should you. There are many filesystems in Linux which all do it in a slightly (or even significantly) different way.
** Not actually physical location. A hard drive may map the logical blocks to physical blocks in some way determine by its firmware.
Our company helps migrate client software from other languages to C++. We provide them C++ source code for their application along with header files and compiled libraries for runtime support functions. We charge for both the migration as well as the runtime. Recently a potential client asked to migrate one of a number of systems they have. This system contains 7 programs and we would like to limit the runtime so only these 7 programs can acess it. We can time limit the runtime by putting an encrypted expiration date in the object library but, since we have to provide the source code for the converted programs, we are having difficult coming up with a way to limit the access to a specific set of programs. Obviously, anything we put into the source code to identify the program could be copied to any other program so the only hope seems to be having the run time library discover some set of characteristics about the programs and then validating them against a set of characteristics embedded in the run time library. As I understand it, C++ has very little reflection capability (RTTI is all I could find) so I wanted to ask if anyone has faced a similar problem and found a way to solve it. Thanks in advance for any suggestions.
Based on the two answers a little clarification seems in order. We fully expect the client to modify the source code and normally we provide them an unrestricted version of the runtime libraries. This particular client requested a version that was limited to a single system and is happy to enter into a license that restricts the use of the runtime library to that system. Therefore a discussion of the legal issues isn't relevant. The issue is a technical one -- given a license that is limited to a single system and given that the client has the source to the calling programs but not the runtime, is there a way to limit access to the runtime to the set of programs comprising that system thus enforcing the terms of the license.
If they're not supposed to make further changes to the programs, why did you give them the source code? And if they are expected to continue changing the programs (i.e. maintenance), who decides whether a change constitutes a new program that's not allowed to use the library?
There's no technical way to enforce that licensing model.
There's possibly a legal way -- in the code that loads/enables the library, write a comment "This is a copy protection measure". Then DMCA forbids them from including that code into other programs (in the USA). But IANAL, and I don't think DMCA is valid anyway.
Consult a lawyer to find out what rights you have under the contract/bill of sale to restrict their use.
The most obvious answer I could think of is to get the name and/or path of the calling process-- simply compare this name to the 7 "allowed" programs in your support library. Certainly, they could create a new process with the same name, but they might not know to do so.
Another level could be to further compare the executable size against the known size for that application. (You'll likely want to allow a reasonably wide range around the expected size, in case they make changes to the source code, and/or compile with different options.)
As another thought, you might try adding some seemingly benign strings into the app's resources. ("Copyright 2011 ~Your Corporation Name~")-- You can then scan the parent executable for the magic strings. If they create a new product, they might not think to create this resource.
Finally, as already noted by Ben, if you are giving them the source code, there are likely no foolproof solutions to this problem. (As he said, at what point does "modified" code become a new application?) The best you will likely be able to do is to add enough small roadblocks that they won't bother trying to use that lib for another product. It likely depends on how determined and/or lucky they are.
Why not just technically limit the use of the runtime to one system? There are many software protection solutions out there, one that comes to my mind is SmartDongle.
Now the runtime could still be used by any other program on that machine, but I think this should be a minor concern, no?
We have a common functionality we need to share among several applications. We already have a few internal libraries, into which we put common code with a well-defined interface. Sometimes, though, there are problems with some code (typically a single or a few .cpp files) as it doesn't fit into an existing library and it is too small to make a new one.
Our current version control system supports file sharing, so usually such files are just shared between the applications that use them. I tend to consider it a bad thing, but actually, it makes it quite clear, as you can see exactly in which applications they are used.
Now, we are moving to svn, which does not have "real" file sharing, there is this svn:externals stuff, but will it still be simple to track the places where the files are shared when using it?
We could create a "garbage" library (or folder) and put such files there temporarily, but it's always the same problem that it complicates dependency tracking (which project use this file?).
Otherwise, are there other good solutions? How does it work in your company?
Why don't you just create a folder in SVN called "Shared" and put your shared files into that? You can include the shared files into your projects from there.
Update:
Seems like you are looking for a 3rd party tool that tracks dependencies.
Subversion and dependencies
You can only find out where a file is used by looking at all repositories.
General question:
For unmanaged C++, what's better for internal code sharing?
Reuse code by sharing the actual source code? OR
Reuse code by sharing the library / dynamic library (+ all the header files)
Whichever it is: what's your strategy for reducing duplicate code (copy-paste syndrome), code bloat?
Specific example:
Here's how we share the code in my organization:
We reuse code by sharing the actual source code.
We develop on Windows using VS2008, though our project actually needs to be cross-platform. We have many projects (.vcproj) committed to the repository; some might have its own repository, some might be part of a repository. For each deliverable solution (.sln) (e.g. something that we deliver to the customer), it will svn:externals all the necessary projects (.vcproj) from the repository to assemble the "final" product.
This works fine, but I'm quite worried about eventually the code size for each solution could get quite huge (right now our total code size is about 75K SLOC).
Also one thing to note is that we prevent all transitive dependency. That is, each project (.vcproj) that is not an actual solution (.sln) is not allowed to svn:externals any other project even if it depends on it. This is because you could have 2 projects (.vcproj) that might depend on the same library (i.e. Boost) or project (.vcproj), thus when you svn:externals both projects into a single solution, svn:externals will do it twice. So we carefully document all dependencies for each project, and it's up to guy that creates the solution (.sln) to ensure all dependencies (including transitive) are svn:externals as part of the solution.
If we reuse code by using .lib , .dll instead, this would obviously reduce the code size for each solution, as well as eliminiate the transitive dependency mentioned above where applicable (exceptions are, for example, third-party library/framework that use dll like Intel TBB and the default Qt)
Addendum: (read if you wish)
Another motivation to share source code might be summed up best by Dr. GUI:
On top of that, what C++ makes easy is
not creation of reusable binary
components; rather, C++ makes it
relatively easy to reuse source code.
Note that most major C++ libraries are
shipped in source form, not compiled
form. It's all too often necessary to
look at that source in order to
inherit correctly from an object—and
it's all too easy (and often
necessary) to rely on implementation
details of the original library when
you reuse it. As if that isn't bad
enough, it's often tempting (or
necessary) to modify the original
source and do a private build of the
library. (How many private builds of
MFC are there? The world will never
know . . .)
Maybe this is why when you look at libraries like Intel Math Kernel library, in their "lib" folder, they have "vc7", "vc8", "vc9" for each of the Visual Studio version. Scary stuff.
Or how about this assertion:
C++ is notoriously non-accommodating
when it comes to plugins. C++ is
extremely platform-specific and
compiler-specific. The C++ standard
doesn't specify an Application Binary
Interface (ABI), which means that C++
libraries from different compilers or
even different versions of the same
compiler are incompatible. Add to that
the fact that C++ has no concept of
dynamic loading and each platform
provide its own solution (incompatible
with others) and you get the picture.
What's your thoughts on the above assertion? Does something like Java or .NET face these kinds of problems? e.g. if I produce a JAR file from Netbeans, will it work if I import it into IntelliJ as long as I ensure that both have compatible JRE/JDK?
People seem to think that C specifies an ABI. It doesn't, and I'm not aware of any standardised compiled language that does. To answer your main question, use of libraries is of course the way to go - I can't imagine doing anything else.
One good reason to share the source code: Templates are one of C++'s best features because they are an elegant way around the rigidity of static typing, but by their nature are a source-level construct. If you focus on binary-level interfaces instead of source-level interfaces, your use of templates will be limited.
We do the same. Trying to use binaries can be a real problem if you need to use shared code on different platforms, build environments, or even if you need different build options such as static vs. dynamic linking to the C runtime, different structure packing settings, etc..
I typically set projects up to build as much from source on-demand as possible, even with third-party code such as zlib and libpng. For those things that must be built separately, e.g. Boost, I typically have to build 4 or 8 different sets of binaries for the various combinations of settings needed (debug/release, VS7.1/VS9, static/dynamic), and manage the binaries along with the debugging information files in source control.
Of course, if everyone sharing your code is using the same tools on the same platform with the same options, then it's a different story.
I never saw shared libraries as a way to reuse code from an old project into a new one. I always thought it was more about sharing a library between different applications that you're developing at about the same time, to minimize bloat.
As far as copy-paste syndrome goes, if I copy and paste it in more than a couple places, it needs to be its own function. That's independent of whether the library is shared or not.
When we reuse code from an old project, we always bring it in as source. There's always something that needs tweaking, and its usually safer to tweak a project-specific version than to tweak a shared version that can wind up breaking the previous project. Going back and fixing the previous project is out of the question because 1) it worked (and shipped) already, 2) it's no longer funded, and 3) the test hardware needed may no longer be available.
For example, we had a communication library that had an API for sending a "message", a block of data with a message ID, over a socket, pipe, whatever:
void Foo:Send(unsigned messageID, const void* buffer, size_t bufSize);
But in a later project, we needed an optimization: the message needed to consist of several blocks of data in different parts of memory concatenated together, and we couldn't (and didn't want to, anyway) do the pointer math to create the data in its "assembled" form in the first place, and the process of copying the parts together into a unified buffer was taking too long. So we added a new API:
void Foo:SendMultiple(unsigned messageID, const void** buffer, size_t* bufSize);
Which would assemble the buffers into a message and send it. (The base class's method allocated a temporary buffer, copied the parts together, and called Foo::Send(); subclasses could use this as a default or override it with their own, e.g. the class that sent the message on a socket would just call send() for each buffer, eliminating a lot of copies.)
Now, by doing this, we have the option of backporting (copying, really) the changes to the older version, but we're not required to backport. This gives the managers flexibility, based on the time and funding constraints they have.
EDIT: After reading Neil's comment, I thought of something that we do that I need to clarify.
In our code, we do lots of "libraries". LOTS of them. One big program I wrote had something like 50 of them. Because, for us and with our build setup, they're easy.
We use a tool that auto-generates makefiles on the fly, taking care of dependencies and almost everything. If there's anything strange that needs to be done, we write a file with the exceptions, usually just a few lines.
It works like this: The tool finds everything in the directory that looks like a source file, generates dependencies if the file changed, and spits out the needed rules. Then it makes a rule to take eveything and ar/ranlib it into a libxxx.a file, named after the directory. All the objects and library are put in a subdirectory that is named after the target platform (this makes cross-compilation easy to support). This process is then repeated for every subdirectory (except the object file subdirs). Then the top-level directory gets linked with all the subdirs' libraries into the executable, and a symlink is created, again, naked after the top-level directory.
So directories are libraries. To use a library in a program, make a symbolic link to it. Painless. Ergo, everything's partitioned into libraries from the outset. If you want a shared lib, you put a ".so" suffix on the directory name.
To pull in a library from another project, I just use a Subversion external to fetch the needed directories. The symlinks are relative, so as long as I don't leave something behind it still works. When we ship, we lock the external reference to a specific revision of the parent.
If we need to add functionality to a library, we can do one of several things. We can revise the parent (if it's still an active project and thus testable), tell Subversion to use the newer revision and fix any bugs that pop up. Or we can just clone the code, replacing the external link, if messing with the parent is too risky. Either way, it still looks like a "library" to us, but I'm not sure that it matches the spirit of a library.
We're in the process of moving to Mercurial, which has no "externals" mechanism so we have to either clone the libraries in the first place, use rsync to keep the code synced between the different repositories, or force a common directory structure so you can have hg pull from multiple parents. The last option seems to be working pretty well.