How do you effectively compare 15000 files multiple times? - compare

I am comparing two almost identical folders which include hidden .svn folders which should be ignored and I want to continually quickly compare the folders as some files are patched to compared the difference without checking the unchanged matching files again.
edit:
Because there are so many options I'm interested in a solution that clearly exploits the knowledge from the previous compare because any other solution is not really feasable when doing repeated comparisons.

If you are willing to spend a bit of money, Beyond Compare is a pretty powerful diffing tool that can do folder based diffing.
Beyond Compare

I personally use WinMerge and find it very useful. It has filters that exclude svn file. Under linux i prefer Meld.

One option would be to use rsync. Something like:
rsync -n -r -v -C dir_a dir_b
The -n option does a dry-run so no files will be modified. -r does a recursive comparison. Optionally turn on verbose mode with -v. (You could use -i to itemize the changes instead of -v.) To ignore commonly ignored files such as .svn/ use -C.
This should be faster than a simple diff as I read the rsync manpage:
Rsync finds files that need to be transferred using a "quick check"
algorithm (by default) that looks for files that have changed in size
or in last-modified time. Any changes in the other preserved
attributes (as requested by options) are made on the destination file
directly when the quick check indicates that the file's data does not
need to be updated.
Since the "quick check" algorithm does not look at file contents directly, it might be fooled. In that case, the -c option, which performs a checksum instead, may be needed. It is likely to be faster than an ordinary diff.
In addition, if you plan on syncing the directories at some point, this is a good tool for that job as well.

Not foolproof, but you could just compare the timestamps.

Use total commander ! All the cool developers use it :)

If you are on linux or some variant, you should be able to do:
prompt$ diff -r dir1 dir2 --exclude=.svn
The -r forces recursive lookups. There are a bunch of switches to ignore stuff like whitespace etc.

Related

How to have ag ignore files with certain extensions (e.g. *.dtd) normally looked at for a given option such as --xml?

Ag works fine most of the time. But sometimes, I want to do a search in, say, all XML files by using --xml, but not any .dtd files. Unfortunately, ag looks at .dtd files when asked to examine --xml.
I tried as a wild guess ag --no-dtd. Doesn't work. The --ignore options aren't for this.
I know there's a ~/.agignore file, but being absent minded and often interrupted, I don't want to bother edit that for just one or two searches and have to remember to put it back. I'm looking for a one-off easy way to avoid a certian file type.
Other examples could be wanting to look at C++ files but wanting to see only .c, .cpp, .cxx and such but not .h or .hpp, for just one time.
Two approaches.
Use the file extension (-G) to limit the scope of what ag looks at. For example, if you just want to search xml files: ag -G '\.xml$' search_token
Roll your own new file type.
Download the source and modify the /src/lang.c file to include your own file types, then (re)compile. Instructions to download and build ag are kept here: https://gist.github.com/k-takata/5124445; sounds like a hassle, but it's easy and fast to do.
Note: BTW, .agignore has been renamed to just .ignore.

a program to monitor a directory on Linux

There is a directory where a buddy adds new builds of a product.
The listing looks like this
$ ls path-to-dir/
01
02
03
04
$
where the numbers listed are not files but names of directories containing the builds.
I have to manually go and check every time whether there is a new build or not. I am looking for a way to automate this, so that the program can send an email to some people (including me) whenever path-to-dir/ is updated.
Do we have an already existing utility or a Perl library that does this?
inotify.h does something similar, but it is not supported on my kernel (2.6.9).
I think there can be an easy way in Perl.
Do you think this will work?
Keep running a loop in Perl that does a ls path-to-dir/ after, say, every 5 minutes and stores the results in an array. If it finds that the new results are different from the old results, it sends out an email using Mail or Email.
If you're going for perl, I'm sure the excellent File::ChangeNotify module will be extremely helpful to you. It can use inotify, if available, but also all sorts of other file-watching mechanisms provided by different platforms. Also, as a fallback, it has its own watching implementation, which works on every platform, but is less efficient than the specialized ones.
Checking for different ls output would send a message even when something is deleted or renamed in the directory. You could instead look for files with an mtime newer than the last message sent.
Here's an example in bash, you can run it every 5 minutes:
now=`date +%Y%m%d%H%M.%S`
if [ ! -f "/path/to/cache/file" ] || [ -n "`find /path/to/build/dir -type f -newer /path/to/cache/file`" ]
then
touch /path/to/cache/file -t "$now"
sendmail -t <<< "
To: aaa#bbb.ccc
To: xxx#yyy.zzz
Subject: New files found
Dear friend,
I have found a couple of new files.
"
fi
Can't it be a simple shell script?
while :;do
n = 'ls -al path-to-dir | wc -l'
if n -gt old_n
# your Mail code here; set old_n=n also
fi
sleep 5
done
Yes, a loop in Perl as described would do the trick.
You could keep a track of when the directory was last modified; if it hasn't changed, there isn't a new build. If it has changed, an old build might have been deleted or a new one added. You probably don't want to send alerts when old builds are removed; it is crucial that the email is sent when new builds are added.
However, I think that msw has the right idea; the build should notify when it has completed the copy out to the new directory. It should be a script that can be changed to notify the correct list of people - rather than a hard-wired list of names in the makefile or whatever other build control system you use.
you could use dnotify it is the predecessor of inotify and should be available on your kernel. It is still supported by newer kernels.

ctags best practicies

I'm working on +1M LOC C/C++ project on Solaris (remote, via VNC or SSH). I have a daily updated copy of source code on my local machine too (Windows, just for browsing code).
I use VIM and ctags combo (on both Solaris and Windows) but I'm not happy with results / speed. What settings for ctags would you recommend? There are a lot of options what should be tagged and how. Should I use single tag file per project, per dir or perhaps just one for everything?
Using anything less than one for everything doesn't really make sense to me. Being able to quickly jump around your project is what tags are for in the first place. For instance, our code is divided into 3 main sections, Include/, Processes/, Libraries/. Without being able to jump between these I would be incredibly unproductive.
Personally I use cscope (its C++ parsing isn't great, but its ok, and its VIM integration is better than just ctags), but when I do use ctags I usually just add --c++-kinds=+p.
I use etags:
find src1 src2 src3 | grep -v "\\.svn" | xargs etags --append
In emacs, position cursor on identifier and press M-. ([alt] + [period], or [esc] followed by [period]).
I don't know how it compares to your setup as far as speed goes, or if you're willing to use emacs. I'm just posting in case you want to try some alternatives.

C++ Directory Restructuring

I have a source code of about 500 files in about 10 directories. I need to refactor the directory structure - this includes changing the directory hierarchy or renaming some directories.
I am using svn version control. There are two ways to refactor: one preserving svn history (using svn move command) and the other without preserving. I think refactoring preserving svn history is a lot easier using eclipse CDT and SVN plugin (visual studio does not fit at all for directory restructuring).
But right now since the code is not released, we have the option to not preserve history.
Still there remains the task of changing the include directives of header files wherever they are included. I am thinking of writing a small script using python - receives a map from current filename to new filename, and makes the rename wherever needed (using something like sed). Has anyone done this kind of directory refactoring? Do you know of good related tools?
If you're having to rewrite the #includes to do this, you did it wrong. Change all your #includes to use a very simple directory structure, at mot two levels deep and only using a second level to organize around architecture or OS dependencies (like sys/types.h).
Then change your make files to use -I include paths.
Voila. You'll never have to hack the code again for this, and compiles will blow up instantly if something goes wrong.
As far as the history part, I personally find it easier to make a clean start when doing this sort of thing; archive the old one, make a new repository v2, go from there. The counterargument is when there is a whole lot of history of changes, or lots of open issues against the existing code.
Oh, and you do have good tests, and you're not doing this with a release coming right up, right?
I would preserve the history, even if it takes a small amount of extra time. There's a lot of value in being able to read through commit logs and understand why function X is written in a weird way, or that this really is an off-by-one error because it was written by Oliver, who always gets that wrong.
The argument against preserving the history can be made for the following users:
your code might have embarrassing things, like profanity and fighting among developers
you don't care about the commit history of your code, because it's not going to change or be maintained in the future
I did some directory refactoring like this last year on our code base. If your code is reasonable structured at the beginning, you can do about 75-90% of the work using scripts written in your language of choice (I used Perl). In my case, we were moving from set of files all in one big directory, to a series of nested directories depending on namespaces. So, a file that declared the class protocols::serialization::SerializerBase was located in src/protocols/serialization/SerializerBase. The mapping from the old name to the new name was trivial, so that doing a find and replace on #includes in every source file in the tree was trivial, although it was a big change. There were a couple of weird edge cases that we had to fix by hand, but that seemed a lot better than either having to do everything by hand or having to write our own C++ parser.
Hacking up a shell script to do the svn moves is trivial. In tcsh it's foreach F ( $FILES ) ... end to adjust a set of files. Perl & Python offer better utility.
It really is worth saving the history. Especially when trying to track down some exotic bug. Those who do not learn from history are doomed to repeat it, or some such junk...
As for altering all the files... There was a similar question just the other day over at:
https://stackoverflow.com/questions/573430/
c-include-header-path-change-windows-to-linux/573531#573531

Resetting detection of source file changes

Sometimes I have to work on code that moves the computer clock forward. In this case some .cpp or .h files get their latest modification date set to the future time.
Later on, when my clock is fixed, and I compile my sources, system rebuilds most of the project because some of the latest modification dates are in the future. Each subsequent recompile has the same problem.
Solution that I know are:
a) Find the file that has the future time and re-save it. This method is not ideal because the project is very big and it takes time even for windows advanced search to find the files that are changed.
b) Delete the whole project and re-check it out from svn.
Does anyone know how I can get around this problem?
Is there perhaps a setting in visual studio that will allow me to tell the compiler to use the archive bit instead of the last modification date to detect source file changes?
Or perhaps there is a recursive modification date reset tool that can be used in this situation?
I would recommend using a virtual machine where you can mess with the clock to your heart's content and it won't affect your development machine. Two free ones are Virtual PC from Microsoft and VirtualBox from Sun.
If this was my problem, I'd look for ways to avoid mucking with the system time. Isolating the code under unit tests, or a virtual machine, or something.
However, because I love PowerShell:
Get-ChildItem -r . |
? { $_.LastWriteTime -gt ([DateTime]::Now) } |
Set-ItemProperty -Name "LastWriteTime" -Value ([DateTime]::Now)
I don't know if this works in your situation but how about you don't move your clock forward, but wrap your gettime method (or whatever you're using) and make it return the future time that you need?
Install Unix Utils
touch temp
find . -newer temp -exec touch {} ;
rm temp
Make sure to use the full path when calling find or it will probably use Windows' find.exe instead. This is untested in the Windows shell -- you might need to modify the syntax a bit.
I don't use windows - but surely there is something like awk or grep that you can use to find the "future" timestamped files, and then "touch" them so they have the right time - even a perl script.
1) Use a build system that doesn't use timestamps to detect modifications, like scons
2) Use ccache to speed up your build system that does use timestamps (and rebuild all).
In either case it is using md5sum's to verify that a file has been modified, not timestamps.