Our code is C++ and is managed in svn. The development is with Visual Studio. As you know Visual Studio C++ is case insensitive to file names and our code unfortunately "exploited" this heavily.
No we are porting our application to Linux + gcc, which is case sensitive. This will involve a lot of file names and file changes. We planned to do the development in separate branch.
It appears that svn rename has a well known problem (here and here).
Is there a way to workaround it? Is git-svn or svnmerge can aid here?
Thanks
Dima
The case sensitivity issue is not about Visual Studio vs. GCC, but rather about the filesystem; the standard filesystems on Windows and Mac OS X (FAT32 and NTFS for Windows, HFS+ for Mac OS X), are case insenstive but case preserving, while Linux filesystems (ext2, ext3, and ext4) are case sensitive. I would suggest that you rename your files, using all lower case for all your source files, and then branch, and -- of course -- for the future, have a strict policy of using lower case and a ".cpp" extension for all C++ source files and ".h" for all header files. Is there any reason you cannot perform this renaming prior to the branch?
Git itself deals (very well) with the problem of renamed files in merges (and not only there) by doing heuristic file-contents and filename similarity based rename detection. It doesn't require to have information about renames entered like rename tracking solution.
There are two questions here, one is the svn limitation on renames and merges, in my opinion once one has decided to go with svn for a project it would not be advisable to switch version control software in the middle. I'd talk to the other developers, and make cycles of locking the whole project and doing the renames.
In my case I solved the case-sensitive problems of the header files with a simple perl script: It will fix carriage returns and set includes to lowercase.
The commented part fixes the includes.
#!/usr/bin/perl
use strict;
use warnings;
#
use File::Find;
use File::Copy;
sub wanted
{
if( m/\.c$/i || m/\.h$/i ) {
my $orig = $_;
my $bak = $orig.".bak";
my $dst = $orig;
system("fromdos",$orig) == 0 or die "fromdos: $?";
# open(FH,'<',$orig) or die "open $orig: $!";
# my #lines;
# while(my $line = <FH>) {
# if( $line =~ m/(^#include\s+")([^"]+)(".*)$/ ) {
# print $line;
# my $inc = $2;
# $inc =~ tr/A-Z/a-z/;
# print "change to:\n";
# print $1.$inc.$3."\n";
# print "\n";
# push #lines, $1 . $inc . $3."\n";
# } else {
# push #lines,$line;
# }
# }
# close(FH);
# #move($orig,$bak) or die "move $orig to $bak: $!";
# unlink($orig);
# open(FH, '>', $dst) or die "open $dst: $!";
# print FH #lines;
# close(FH);
}
}
find(\&wanted, ".");
As others said, the original problem has nothing to do with SCMs, really. As for using git, you could do the merge in git-svn and push it back to the SVN repo - just be aware ahead of time that this is a one-time option, i.e., don't expect SVN to realize that this commit was a merge or even that files were renamed - you'll lose file history unless you're really careful.
As a side note alongside "really careful" option, the only way to make git-svn push correct "file rename" info to svn that seems to work reliably is to rename the files in git-svn without changing any contents, commit, and then modify whatever files you want and do another commit. If you modify the renamed file before committing, git-svn knows that the file has likely been moved but apparently does not trust its own heuristic enough to push this info back to svn. It's quite possible that I'm missing some magical option that makes this work better :)
Related
I need to wrap all double quoted strings in all .cpp & .h file in a directory with a macro _T()
All files are unicode.
Can anyone help me write a perl or bash shell?
I know that perl should be great in it. I just know a bit of bash shell and cannot completely make it automatically working.
Now I use ^(?!#)(.*)(".*?") and $1_T($2) in sublime Text2, I don't know why it is just part of replace. ( some strings add _T() while some not). And some file like readme.txt *.poj etc should not be replace.
And avoiding repeat replace, I know the \b is word boundary. But ^(?!#).*\b(!_T\(")(".*?")\b seem not working.
This is a harder problem then you realize, however a quickie solution could be the following:
use strict;
use warnings;
for my $file (glob("*.cpp"), glob("*.h")) {
local #ARGV = $file;
local $^I = '.bak';
die "Backup already exists for $file" if -e "$file$^I";
while (<>) {
if (! /^#include/) {
s/("(?:(?>[^"\\]+)|\\.)*")/_T($1)/g;
}
print;
}
# unlink "$file$^I"; # If you want to delete the backup
}
I would of course change the for loop to a single file during testing:
for my $file ("single_file_to_test.cpp") {
And you can uncomment the unlink command if you'd like the delete the backups that are placed in .bak
Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
#snip...
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
}
}
#snip...
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.
Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
system(...);
}
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.
Get rid of the double quotes around the regular expression in the grep function.
UPDATE 2: Solved. See below.
I am in the process of converting a big txt-file from an old DOS-based library program into a more usable format. I just started out in Perl and managed to put together a script such as this one:
BEGIN {undef $/; };
open $in, '<', "orig.txt" or die "Can't read old file: $!";
open $out, '>', "mod.txt" or die "Can't write new file: $!";
while( <$in> )
{
$C=s/foo/bar/gm;
print "$C matches replaced.\n"
etc...
print $out $_;
}
close $out;
It is quite fast but after some time I always get an "Out of Memory"-Error due lack of RAM/Swap-Space (I'm on Win XP with 2GB of Ram and a 1.5GB Swap-File).
After having looked around a bit on how to deal with big files, File::Map seemed to me as a good way to avoid this problem. I'm having trouble implementing it, though.
This is what I have for now:
#!perl -w
use strict;
use warnings;
use File::Map qw(map_file);
my $out = 'output.txt';
map_file my $map, 'input.txt', '<';
$map =~ s/foo/bar/gm;
print $out $map;
However I get the following error: Modification of a read-only value attempted at gott.pl line 8.
Also, I read on the File::Map help page, that on non-Unix systems I need to use binmode. How do I do that?
Basically, what I want to do is to "load" the file via File::Map and then run code like the following:
$C=s/foo/bar/gm;
print "$C matches found and replaced.\n"
$C=s/goo/far/gm;
print "$C matches found and replaced.\n"
while(m/complex_condition/gm)
{
$C=s/complex/regex/gm;
$run_counter++;
}
print "$C matches replaced. Script looped $run_counter times.\n";
etc...
I hope that I didn't overlook something too obvious but the example given on the File::Map help page only shows how to read from a mapped file, correct?
EDIT:
In order to better illustrate what I currently can't accomplish due to running out of memory I'll give you an example:
On http://pastebin.com/6Ehnx6xA is a sample of one of our exported library records (txt-format). I'm interested in the +Deskriptoren: part starting on line 46. These are thematic classifiers which are organised in a tree hierarchy.
What I want is to expand each classifier with its complete chain of parent nodes, but only if none of the parent nodes are not already present before or after the child node in question. This means turning
+Deskriptoren
-foo
-Cultural Revolution
-bar
into
+Deskriptoren
-foo
-History
-Modern History
-PRC
-Cultural Revolution
-bar
The currently used Regex makes use of Lookbehind and Lookahead in order to avoid duplicates duplicates and is thus slightly more complicated than s/foo/bar/g;:
s/(?<=\+Deskriptoren:\n)((?:-(?!\QParent-Node\E).+\n)*)-(Child-Node_1|Child-Node_2|...|Child-Node_11)\n((?:-(?!Parent-Node).+\n)*)/${1}-Parent-Node\n-${2}\n${3}/g;
But it works! Until Perl runs out of memory that is... :/
So in essence I need a way to do manipulations on a large file (80MB) over several lines. Processing time is not an issue. This is why I thought of File::Map.
Another option could be to process the file in several steps with linked perl-scripts calling each other and then terminating, but I'd like to keep it as much in one place as possible.
UPDATE 2:
I managed to get it working with Schwelm's code below. My script now calls the following subroutine which calls two nested subroutines. Example code is at: http://pastebin.com/SQd2f8ZZ
Still not quite satisfied that I couldn't get File::Map to work. Oh well... I guess that the line-approach is more efficient anyway.
Thanks everyone!
When you set $/ (the input record separator) to undefined, you are "slurping" the file -- reading the entire content of the file at once (this is discussed in perlvar, for example). Hence the out-of-memory problem.
Instead, process it one line at a time, if you can:
while (my $line = <$in>){
# Do stuff.
}
In situations where the file is small enough and you do slurp the file, there is no need for the while loop. The first read gets everything:
{
local $/ = undef;
my $file_content = <>;
# Do stuff with the complete file.
}
Update
After seeing your massive regex I would urge you reconsider your strategy. Tackle this as a parsing problem: process the file one line at a time, storing information about the parser's state as needed. This approach allows you to work with the information using simple, easily understood (even testable) steps.
Your current strategy -- one might call it the slurp and whack with massive regex strategy -- is difficult to understand and maintain (in 3 months will your regex makes immediate sense to you?), difficult to test and debug, and difficult to adjust if you discover unanticipated deviations from your initial understanding of the data. In addition, as you've discovered, the strategy is vulnerable to memory limitations (because of the need to slurp the file).
There are many questions on StackOverflow illustrating how one can parse text when the meaningful units span multiple lines. Also see this question, where I delivered similar advice to another questioner.
Some simple parsing can break the file down into manageable chunks. The algorithm is:
1. Read until you see `+Deskriptoren:`
2. Read everything after that until the next `+Foo:` line
3. Munge that bit.
4. Goto 1.
Here's the sketch of the code:
use strict;
use warnings;
use autodie;
open my $in, $input_file;
open my $out, $output_file;
while(my $line = <$in>) {
# Print out everything you don't modify
# this includes the +Deskriptoren line.
print $out $line;
# When the start of a description block is seen, slurp in up to
# the next section.
if( $line =~ m{^ \Q Deskriptoren: }x ) {
my($section, $next_line) = _read_to_next_section($in);
# Print the modified description
print $out _munge_description($section);
# And the following header line.
print $out $next_line;
}
}
sub _read_to_next_section {
my $in = shift;
my $section = '';
my $line;
while( $line = <$in> ) {
last if $line =~ /^ \+ /x;
$section .= $line;
}
# When reading the last section, there might not be a next line
# resulting in $line begin undefined.
$line = '' if !defined $line;
return($section, $line);
}
# Note, the +Deskriptoren line is not on $description
sub _munge_description {
my $description = shift;
...whatever you want to do to the description...
return $description;
}
I haven't tested it, but something like that should do you. It has the advantage over dealing with the whole file as a string (File::Map or otherwise) that you can deal with each section individually rather than trying to cover every base in one regex. It also will let you develop a more sophisticated parser to deal with things like comments and strings that might mess up the simple parsing above and would be a huge pain to adapt a massive regex to.
You are using mode <, which is read-only. If you want to modify the contents, you need read-write access, so you should be using +<.
If you are on windows, and need binary mode, then you should open the file separately, set binary mode on the file handle, then map from that handle.
I also noticed that you have an input file and an output file. If you use File::Map, you are changing the file in-place... that is, you can't open the file for reading and change the contents of a different file. You would need to copy the file, then modify the copy. I've done so below.
use strict;
use warnings;
use File::Map qw(map_file);
use File::Copy;
copy("input.txt", "output.txt") or die "Cannot copy input.txt to output.txt: $!\n";
open my $fh, '+<', "output.txt"
or die "Cannot open output.txt in r/w mode: $!\n";
binmode($fh);
map_handle my $contents, $fh, '+<';
my $n_changes = ( $contents =~ s/from/to/gm );
unmap($contents);
close($fh);
The documentation for File::Map isn't very good on how errors are signaled, but from the source, it looks as if $contents being undefined would be a good guess.
I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the names of the files would be different, but the content may match.
Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?
if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
if you want to work with Python, a simple implementation
import hashlib,os
def sha(filename):
''' function to get sha of file '''
d = hashlib.sha512()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else:
print "Duplicates: %s <==> %s " %( filename, s[digest])
if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)
You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?
The following is the Perl program to do the above thing:
use strict;
use File::Find;
use Digest::MD5 qw(md5);
my #directories_to_search = ('a','e');
my %hash;
find(\&wanted, #directories_to_search);
sub wanted {
chdir $File::Find::dir;
if( -f $_) {
my $con = '';
open F,"<",$_ or die;
while(my $line = <F>) {
$con .= $line;
}
close F;
if($hash{md5($con)}) {
print "Dup found: $File::Find::name and $hash{md5($con)}\n";
} else {
$hash{md5($con)} = $File::Find::name;
}
}
}
MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content
PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux
EDIT:
Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))
However, now that it is easy to
generate MD5 collisions, it is
possible for the person who created
the file to create a second file with
the same checksum, so this technique
cannot protect against some forms of
malicious tampering. Also, in some
cases the checksum cannot be trusted
(for example, if it was obtained over
the same channel as the downloaded
file), in which case MD5 can only
provide error-checking functionality:
it will recognize a corrupt or
incomplete download, which becomes
more likely when downloading larger
files.
Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.
Regex will not help with this problem.
There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)
Well, I tried and failed so, here I am again.
I need to match my abs path pattern.
/public_html/mystuff/10000001/001/10/01.cnt
I am in taint mode etc..
#!/usr/bin/perl -Tw
use CGI::Carp qw(fatalsToBrowser);
use strict;
use warnings;
$ENV{PATH} = "bin:/usr/bin";
delete ($ENV{qw(IFS CDPATH BASH_ENV ENV)});
I need to open the same file a couple times or more and taint forces me to untaint the file name every time. Although I may be doing something else wrong, I still need help constructing this pattern for future reference.
my $file = "$var[5]";
if ($file =~ /(\w{1}[\w-\/]*)/) {
$under = "/$1\.cnt";
} else {
ErroR();
}
You can see by my beginner attempt that I am close to clueless.
I had to add the forward slash and extension to $1 due to my poorly constructed, but working, regex.
So, I need help learning how to fix my expression so $1 represents /public_html/mystuff/10000001/001/10/01.cnt
Could someone hold my hand here and show me how to make:
$file =~ /(\w{1}[\w-\/]*)/ match my absolute path /public_html/mystuff/10000001/001/10/01.cnt ?
Thanks for any assistance.
Edit: Using $ in the pattern (as I did before) is not advisable here because it can match \n at the end of the filename. Use \z instead because it unambiguously matches the end of the string.
Be as specific as possible in what you are matching:
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
if ( $fn =~ m!
^(
/public_html
/mystuff
/[0-9]{8}
/[0-9]{3}
/[0-9]{2}
/[0-9]{2}\.cnt
)\z!x ) {
print $1, "\n";
}
Alternatively, you can reduce the vertical space taken by the code by putting the what I assume to be a common prefix '/public_html/mystuff' in a variable and combining various components in a qr// construct (see perldoc perlop) and then use the conditional operator ?::
#!/usr/bin/perl
use strict;
use warnings;
my $fn = '/public_html/mystuff/10000001/001/10/01.cnt';
my $prefix = '/public_html/mystuff';
my $re = qr!^($prefix/[0-9]{8}/[0-9]{3}/[0-9]{2}/[0-9]{2}\.cnt)\z!;
$fn = $fn =~ $re ? $1 : undef;
die "Filename did not match the requirements" unless defined $fn;
print $fn, "\n";
Also, I cannot reconcile using a relative path as you do in
$ENV{PATH} = "bin:/usr/bin";
with using taint mode. Did you mean
$ENV{PATH} = "/bin:/usr/bin";
You talk about untainting the file path every time. That's probably because you aren't compartmentalizing your program steps.
In general, I break up these sort of programs into stages. One of the earlier stages is data validation. Before I let the program continue, I validate all the data that I can. If any of it doesn't fit what I expect, I don't let the program continue. I don't want to get half-way through something important (like inserting stuff into a database) only to discover something is wrong.
So, when you get the data, untaint all of it and store the values in a new data structure. Don't use the original data or the CGI functions after that. The CGI module is just there to hand data to your program. After that, the rest of the program should know as little about CGI as possible.
I don't know what you are doing, but it's almost always a design smell to take actual filenames as input.