Finding duplicate files by content across multiple directories - regex

I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the names of the files would be different, but the content may match.
Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?

if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
if you want to work with Python, a simple implementation
import hashlib,os
def sha(filename):
''' function to get sha of file '''
d = hashlib.sha512()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else:
print "Duplicates: %s <==> %s " %( filename, s[digest])
if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)

You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?
The following is the Perl program to do the above thing:
use strict;
use File::Find;
use Digest::MD5 qw(md5);
my #directories_to_search = ('a','e');
my %hash;
find(\&wanted, #directories_to_search);
sub wanted {
chdir $File::Find::dir;
if( -f $_) {
my $con = '';
open F,"<",$_ or die;
while(my $line = <F>) {
$con .= $line;
}
close F;
if($hash{md5($con)}) {
print "Dup found: $File::Find::name and $hash{md5($con)}\n";
} else {
$hash{md5($con)} = $File::Find::name;
}
}
}

MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content
PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux
EDIT:
Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))
However, now that it is easy to
generate MD5 collisions, it is
possible for the person who created
the file to create a second file with
the same checksum, so this technique
cannot protect against some forms of
malicious tampering. Also, in some
cases the checksum cannot be trusted
(for example, if it was obtained over
the same channel as the downloaded
file), in which case MD5 can only
provide error-checking functionality:
it will recognize a corrupt or
incomplete download, which becomes
more likely when downloading larger
files.

Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.
Regex will not help with this problem.
There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)

Related

Extract unique lines from files (with a pattern) recursively from directory/subdirectories

I have a huge java codebase (more than 10,000 java classes) that makes extensive use of CORBA (no documentation available on its usage though).
As first step to figure out the CORBA usage, I decided to scan entire codebase and extract/print unique lines which contain the pattern "org.omg.CORBA". These are usually in the import statements (e.g. import org.omg.CORBA.x.y.z).
I am newbie to Perl and want to know if there is a way I can extract these details on Windows OS. I need to be able to scan all folders (and sub-folders) that have java classes.
You can use File::Find in a one-liner:
perl -MFile::Find -lwe "
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
while(<>) { /org.omg.CORBA/ && $seen{$_}++; };
print for keys %seen;"
Note that this one-liner is using the double quotes required for Windows.
This will search the current directory recursively for files with extension .java and add them to the #ARGV array. Then we use the diamond operator to open the files and search for the string org.omg.CORBA, and if it is found, that line is added as a key to the %seen hash, which will effectively remove duplicates. The last statement prints out all the unique keys in the hash.
In script form it looks like this:
use strict;
use warnings;
use File::Find;
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
my %seen;
while(<>) {
/org.omg.CORBA/ && $seen{$_}++;
}
print "$_\n" for keys %seen;"
Just for fun, a perl one-liner to do this:
perl -lne '/org.omg.CORBA/ and (++$seen{$_}>1 or print)' *
This first checks if a line matches and then if it has not seen it before prints out the line. That is done for all files specified (in this case '*').
i don't mean to be contrarian, but i'm not sure perl is the best solution here. nhahtdh's suggestion of using cygwin is a good one. grep or find is really what you want. using perl in this instance will involve using File::Find and then opening a filehandle on every file. that's certainly do-able, but, if possible, i'd suggest using the right tool for the job.
find . -name "*.java" -type f | xargs grep -l 'org.com.CORBA' | sort | uniq
if you really must use perl for this job we can work up the File::Find code.

Reorganizing large amount of files with regex?

I have a large amount of files organized in a hierarchy of folders and particular file name notations and extensions. What I need to do, is write a program to walk through the tree of files and basically rename and reorganize them. I also need to generate a report of the changes and information about the transformed organization along with statistics.
The solution that I can see, is to walk through the tree of files just like any other tree data structure, and use regular expressions on the path name of the files. This seems very doable and not a huge amount of work. My questions are, is there tools I should be using other than just C# and regex? Perl comes to mind since I know it was originally designed for report generation, but I have no experience with the language. And also, is using regex for this situation viable, because I have only used it for file CONTENTS not file names and organization.
Yes, Perl can do this. Here's something pretty simple:
#! /usr/bin/env perl
use strict;
use warnings;
use File::Find;
my $directory = "."; #Or whatever directory tree you're looking for...
find (\&wanted, $directory);
sub wanted {
print "Full File Name = <$File::Find::name>\n";
print "Directory Name = <$File::Find::dir>\n";
print "Basename = <$_\n>";
# Using tests to see various things about the file
if (-f $File::Find::name) {
print "File <$File::Find::name> is a file\n";
}
if (-d $File::Find::name) {
print "Directory <$File::Find::name> is a directory\n";
}
# Using regular expressions on the file name
if ($File::Find::name =~ /beans/) { #Using Regular expressions on file names
print "The file <$File::Find::name> contains the string <beans>\n";
}
}
The find command takes the directory, and calls the wanted subroutine for each file and directory in the entire directory tree. It is up to that subroutine to figure out what to do with that file.
As you can see, you can do various tests on the file, and use regular expressions to parse the file's name. You can also move, rename, or delete the file to your heart's content.
Perl will do exactly what you want. Now, all you have to do is learn it.
If you can live with glob patterns instead of regular expressions, mmv might be an option.
> ls
a1.txt a2.txt b34.txt
> mmv -v "?*.txt" "#2 - #1.txt"
a1.txt -> 1 - a.txt : done
a2.txt -> 2 - a.txt : done
b34.txt -> 34 - b.txt : done
Directories at any depth can be reorganized, too. Check out the manual. If you run Windows, you can find the tool in Cygwin.

extract audio from certain files in working dir in perl

Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
#snip...
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
}
}
#snip...
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.
Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
system(...);
}
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.
Get rid of the double quotes around the regular expression in the grep function.

How to regex search/replace with File::Map in largish text file with to avoid "Out of Memory"-Error?

UPDATE 2: Solved. See below.
I am in the process of converting a big txt-file from an old DOS-based library program into a more usable format. I just started out in Perl and managed to put together a script such as this one:
BEGIN {undef $/; };
open $in, '<', "orig.txt" or die "Can't read old file: $!";
open $out, '>', "mod.txt" or die "Can't write new file: $!";
while( <$in> )
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;
It is quite fast but after some time I always get an "Out of Memory"-Error due lack of RAM/Swap-Space (I'm on Win XP with 2GB of Ram and a 1.5GB Swap-File).
After having looked around a bit on how to deal with big files, File::Map seemed to me as a good way to avoid this problem. I'm having trouble implementing it, though.
This is what I have for now:
#!perl -w
use strict;
use warnings;
use File::Map qw(map_file);
my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;
print $out $map;
However I get the following error: Modification of a read-only value attempted at gott.pl line 8.
Also, I read on the File::Map help page, that on non-Unix systems I need to use binmode. How do I do that?
Basically, what I want to do is to "load" the file via File::Map and then run code like the following:
$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"
$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...
I hope that I didn't overlook something too obvious but the example given on the File::Map help page only shows how to read from a mapped file, correct?
EDIT:
In order to better illustrate what I currently can't accomplish due to running out of memory I'll give you an example:
On http://pastebin.com/6Ehnx6xA is a sample of one of our exported library records (txt-format). I'm interested in the +Deskriptoren: part starting on line 46. These are thematic classifiers which are organised in a tree hierarchy.
What I want is to expand each classifier with its complete chain of parent nodes, but only if none of the parent nodes are not already present before or after the child node in question. This means turning
+Deskriptoren
-foo
-Cultural Revolution
-bar
into
+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar
The currently used Regex makes use of Lookbehind and Lookahead in order to avoid duplicates duplicates and is thus slightly more complicated than s/foo/bar/g;:
s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;
But it works! Until Perl runs out of memory that is... :/
So in essence I need a way to do manipulations on a large file (80MB) over several lines. Processing time is not an issue. This is why I thought of File::Map.
Another option could be to process the file in several steps with linked perl-scripts calling each other and then terminating, but I'd like to keep it as much in one place as possible.
UPDATE 2:
I managed to get it working with Schwelm's code below. My script now calls the following subroutine which calls two nested subroutines. Example code is at: http://pastebin.com/SQd2f8ZZ
Still not quite satisfied that I couldn't get File::Map to work. Oh well... I guess that the line-approach is more efficient anyway.
Thanks everyone!
When you set $/ (the input record separator) to undefined, you are "slurping" the file -- reading the entire content of the file at once (this is discussed in perlvar, for example). Hence the out-of-memory problem.
Instead, process it one line at a time, if you can:
while (my $line = <$in>){
# Do stuff.
}
In situations where the file is small enough and you do slurp the file, there is no need for the while loop. The first read gets everything:
{
local $/ = undef;
my $file_content = <>;
# Do stuff with the complete file.
}
Update
After seeing your massive regex I would urge you reconsider your strategy. Tackle this as a parsing problem: process the file one line at a time, storing information about the parser's state as needed. This approach allows you to work with the information using simple, easily understood (even testable) steps.
Your current strategy -- one might call it the slurp and whack with massive regex strategy -- is difficult to understand and maintain (in 3 months will your regex makes immediate sense to you?), difficult to test and debug, and difficult to adjust if you discover unanticipated deviations from your initial understanding of the data. In addition, as you've discovered, the strategy is vulnerable to memory limitations (because of the need to slurp the file).
There are many questions on StackOverflow illustrating how one can parse text when the meaningful units span multiple lines. Also see this question, where I delivered similar advice to another questioner.
Some simple parsing can break the file down into manageable chunks. The algorithm is:
1. Read until you see `+Deskriptoren:`
2. Read everything after that until the next `+Foo:` line
3. Munge that bit.
4. Goto 1.
Here's the sketch of the code:
use strict;
use warnings;
use autodie;
open my $in, $input_file;
open my $out, $output_file;
while(my $line = <$in>) {
# Print out everything you don't modify
# this includes the +Deskriptoren line.
print $out $line;
# When the start of a description block is seen, slurp in up to
# the next section.
if( $line =~ m{^ \Q Deskriptoren: }x ) {
my($section, $next_line) = _read_to_next_section($in);
# Print the modified description
print $out _munge_description($section);
# And the following header line.
print $out $next_line;
}
}
sub _read_to_next_section {
my $in = shift;
my $section = '';
my $line;
while( $line = <$in> ) {
last if $line =~ /^ \+ /x;
$section .= $line;
}
# When reading the last section, there might not be a next line
# resulting in $line begin undefined.
$line = '' if !defined $line;
return($section, $line);
}
# Note, the +Deskriptoren line is not on $description
sub _munge_description {
my $description = shift;
...whatever you want to do to the description...
return $description;
}
I haven't tested it, but something like that should do you. It has the advantage over dealing with the whole file as a string (File::Map or otherwise) that you can deal with each section individually rather than trying to cover every base in one regex. It also will let you develop a more sophisticated parser to deal with things like comments and strings that might mess up the simple parsing above and would be a huge pain to adapt a massive regex to.
You are using mode <, which is read-only. If you want to modify the contents, you need read-write access, so you should be using +<.
If you are on windows, and need binary mode, then you should open the file separately, set binary mode on the file handle, then map from that handle.
I also noticed that you have an input file and an output file. If you use File::Map, you are changing the file in-place... that is, you can't open the file for reading and change the contents of a different file. You would need to copy the file, then modify the copy. I've done so below.
use strict;
use warnings;
use File::Map qw(map_file);
use File::Copy;
copy("input.txt", "output.txt") or die "Cannot copy input.txt to output.txt: $!\n";
open my $fh, '+<', "output.txt"
or die "Cannot open output.txt in r/w mode: $!\n";
binmode($fh);
map_handle my $contents, $fh, '+<';
my $n_changes = ( $contents =~ s/from/to/gm );
unmap($contents);
close($fh);
The documentation for File::Map isn't very good on how errors are signaled, but from the source, it looks as if $contents being undefined would be a good guess.

svn rename problem

Our code is C++ and is managed in svn. The development is with Visual Studio. As you know Visual Studio C++ is case insensitive to file names and our code unfortunately "exploited" this heavily.
No we are porting our application to Linux + gcc, which is case sensitive. This will involve a lot of file names and file changes. We planned to do the development in separate branch.
It appears that svn rename has a well known problem (here and here).
Is there a way to workaround it? Is git-svn or svnmerge can aid here?
Thanks
Dima
The case sensitivity issue is not about Visual Studio vs. GCC, but rather about the filesystem; the standard filesystems on Windows and Mac OS X (FAT32 and NTFS for Windows, HFS+ for Mac OS X), are case insenstive but case preserving, while Linux filesystems (ext2, ext3, and ext4) are case sensitive. I would suggest that you rename your files, using all lower case for all your source files, and then branch, and -- of course -- for the future, have a strict policy of using lower case and a ".cpp" extension for all C++ source files and ".h" for all header files. Is there any reason you cannot perform this renaming prior to the branch?
Git itself deals (very well) with the problem of renamed files in merges (and not only there) by doing heuristic file-contents and filename similarity based rename detection. It doesn't require to have information about renames entered like rename tracking solution.
There are two questions here, one is the svn limitation on renames and merges, in my opinion once one has decided to go with svn for a project it would not be advisable to switch version control software in the middle. I'd talk to the other developers, and make cycles of locking the whole project and doing the renames.
In my case I solved the case-sensitive problems of the header files with a simple perl script: It will fix carriage returns and set includes to lowercase.
The commented part fixes the includes.
#!/usr/bin/perl
use strict;
use warnings;
#
use File::Find;
use File::Copy;
sub wanted
{
if( m/\.c$/i || m/\.h$/i ) {
my $orig = $_;
my $bak = $orig.".bak";
my $dst = $orig;
system("fromdos",$orig) == 0 or die "fromdos: $?";
# open(FH,'<',$orig) or die "open $orig: $!";
# my #lines;
# while(my $line = <FH>) {
# if( $line =~ m/(^#include\s+")([^"]+)(".*)$/ ) {
# print $line;
# my $inc = $2;
# $inc =~ tr/A-Z/a-z/;
# print "change to:\n";
# print $1.$inc.$3."\n";
# print "\n";
# push #lines, $1 . $inc . $3."\n";
# } else {
# push #lines,$line;
# }
# }
# close(FH);
# #move($orig,$bak) or die "move $orig to $bak: $!";
# unlink($orig);
# open(FH, '>', $dst) or die "open $dst: $!";
# print FH #lines;
# close(FH);
}
}
find(\&wanted, ".");
As others said, the original problem has nothing to do with SCMs, really. As for using git, you could do the merge in git-svn and push it back to the SVN repo - just be aware ahead of time that this is a one-time option, i.e., don't expect SVN to realize that this commit was a merge or even that files were renamed - you'll lose file history unless you're really careful.
As a side note alongside "really careful" option, the only way to make git-svn push correct "file rename" info to svn that seems to work reliably is to rename the files in git-svn without changing any contents, commit, and then modify whatever files you want and do another commit. If you modify the renamed file before committing, git-svn knows that the file has likely been moved but apparently does not trust its own heuristic enough to push this info back to svn. It's quite possible that I'm missing some magical option that makes this work better :)