Reorganizing large amount of files with regex? - regex

I have a large amount of files organized in a hierarchy of folders and particular file name notations and extensions. What I need to do, is write a program to walk through the tree of files and basically rename and reorganize them. I also need to generate a report of the changes and information about the transformed organization along with statistics.
The solution that I can see, is to walk through the tree of files just like any other tree data structure, and use regular expressions on the path name of the files. This seems very doable and not a huge amount of work. My questions are, is there tools I should be using other than just C# and regex? Perl comes to mind since I know it was originally designed for report generation, but I have no experience with the language. And also, is using regex for this situation viable, because I have only used it for file CONTENTS not file names and organization.

Yes, Perl can do this. Here's something pretty simple:
#! /usr/bin/env perl
use strict;
use warnings;
use File::Find;
my $directory = "."; #Or whatever directory tree you're looking for...
find (\&wanted, $directory);
sub wanted {
print "Full File Name = <$File::Find::name>\n";
print "Directory Name = <$File::Find::dir>\n";
print "Basename = <$_\n>";
# Using tests to see various things about the file
if (-f $File::Find::name) {
print "File <$File::Find::name> is a file\n";
}
if (-d $File::Find::name) {
print "Directory <$File::Find::name> is a directory\n";
}
# Using regular expressions on the file name
if ($File::Find::name =~ /beans/) { #Using Regular expressions on file names
print "The file <$File::Find::name> contains the string <beans>\n";
}
}
The find command takes the directory, and calls the wanted subroutine for each file and directory in the entire directory tree. It is up to that subroutine to figure out what to do with that file.
As you can see, you can do various tests on the file, and use regular expressions to parse the file's name. You can also move, rename, or delete the file to your heart's content.
Perl will do exactly what you want. Now, all you have to do is learn it.

If you can live with glob patterns instead of regular expressions, mmv might be an option.
> ls
a1.txt a2.txt b34.txt
> mmv -v "?*.txt" "#2 - #1.txt"
a1.txt -> 1 - a.txt : done
a2.txt -> 2 - a.txt : done
b34.txt -> 34 - b.txt : done
Directories at any depth can be reorganized, too. Check out the manual. If you run Windows, you can find the tool in Cygwin.

Related

Iterating over directory with specified path in Bash

pathToBins=$1
bins="${pathToBins}contigs.fa.metabat-bins-*"
for fileName in $bins
do
echo $fileName
done
My goal is to attach a path to my file name. I can iterate over a folder and get the file name when I don't attach the path. My challenge is when I add the path echo fileName my regular expression no longer works and I get "/home/erikrasmussen/Desktop/Script/realLargeMetaBatBinscontigs.fa.metabat-bins-*" where the regular expression '*' is treated like a string. How can I get the path and also the full file name while iterating over a folder of files?
Although I don't really know how your files are arranged on your hard drive, a casual glance at "/home/erikrasmussen/Desktop/Script/realLargeMetaBatBinscontigs.fa.metabat-bins-*" suggests that it is missing a / before contigs. If that is the case, then you should change your definition of bins to:
bins="${pathToBins}/contigs.fa.metabat-bins-*"
However, it is much more robust to use bash arrays instead of relying on filenames to not include whitespace and metacharacters. So I would suggest:
bins=(${pathToBins}/contigs.fa.metabat-bins-*)
for fileName in "${bins[#]}"
do
echo "$fileName"
done
Bash normally does not expand a pattern which doesn't match any file, so in that case you will see the original pattern. If you use the array formulation above, you could set the bash option nullglob, which will cause the unmatched pattern to vanish instead, leaving an empty array.

Read from text file, capture text, write to another text file in specific syntax

I have a text file A with the following syntax:
Attribute_Name, 'Path', 'Tutorial';
Attribute_Name2, 'Path2', 'Tutorial';
....
What I need to do is to read from that file, capture those 3 values: Attribute Name, Path and Project Name (tutorial in that case) and write it to output text file, B with the following syntax:
DELETE ATTRIBUTE "Atribute_Name" IN FOLDER "Path" FROM PROJECT "Tutorial";
and repeat for as many iterations as there are lines in the input file.
What is the best(easiest) language to implement that? Can anyone provide example code for that?
I'd personally do something like that with Perl, because I'm familiar with Perl and it works great for these kinds of tasks. You can also write a sed one-liner to get that done.
If you're not a fan of Perl, any modern dynamic language should let you get the job done with minimal effort.
EDIT: An example Perl script (full file for readability) would look like this:
use warnings;
use strict;
while (my $line = <>) {
$line =~ /^\s*(.+?), '(.+?)', '(.+?)';$/; # Doesn't handle internal escaping
print "DELETE ATTRIBUTE \"$1\" IN FOLDER \"$2\" FROM PROJECT \"$3\";\n";
}
See the result.

Extract unique lines from files (with a pattern) recursively from directory/subdirectories

I have a huge java codebase (more than 10,000 java classes) that makes extensive use of CORBA (no documentation available on its usage though).
As first step to figure out the CORBA usage, I decided to scan entire codebase and extract/print unique lines which contain the pattern "org.omg.CORBA". These are usually in the import statements (e.g. import org.omg.CORBA.x.y.z).
I am newbie to Perl and want to know if there is a way I can extract these details on Windows OS. I need to be able to scan all folders (and sub-folders) that have java classes.
You can use File::Find in a one-liner:
perl -MFile::Find -lwe "
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
while(<>) { /org.omg.CORBA/ && $seen{$_}++; };
print for keys %seen;"
Note that this one-liner is using the double quotes required for Windows.
This will search the current directory recursively for files with extension .java and add them to the #ARGV array. Then we use the diamond operator to open the files and search for the string org.omg.CORBA, and if it is found, that line is added as a key to the %seen hash, which will effectively remove duplicates. The last statement prints out all the unique keys in the hash.
In script form it looks like this:
use strict;
use warnings;
use File::Find;
find(sub { if (-f && /\.java$/) { push #ARGV,$File::Find::name } },'.');
my %seen;
while(<>) {
/org.omg.CORBA/ && $seen{$_}++;
}
print "$_\n" for keys %seen;"
Just for fun, a perl one-liner to do this:
perl -lne '/org.omg.CORBA/ and (++$seen{$_}>1 or print)' *
This first checks if a line matches and then if it has not seen it before prints out the line. That is done for all files specified (in this case '*').
i don't mean to be contrarian, but i'm not sure perl is the best solution here. nhahtdh's suggestion of using cygwin is a good one. grep or find is really what you want. using perl in this instance will involve using File::Find and then opening a filehandle on every file. that's certainly do-able, but, if possible, i'd suggest using the right tool for the job.
find . -name "*.java" -type f | xargs grep -l 'org.com.CORBA' | sort | uniq
if you really must use perl for this job we can work up the File::Find code.

extract audio from certain files in working dir in perl

Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
#snip...
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
}
}
#snip...
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.
Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
system(...);
}
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.
Get rid of the double quotes around the regular expression in the grep function.

Finding duplicate files by content across multiple directories

I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the names of the files would be different, but the content may match.
Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?
if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
if you want to work with Python, a simple implementation
import hashlib,os
def sha(filename):
''' function to get sha of file '''
d = hashlib.sha512()
try:
d.update(open(filename).read())
except Exception,e:
print e
else:
return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
for files in f:
filename=os.path.join(r,files)
digest=sha(filename)
if not s.has_key(digest):
s[digest]=filename
else:
print "Duplicates: %s <==> %s " %( filename, s[digest])
if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)
You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?
The following is the Perl program to do the above thing:
use strict;
use File::Find;
use Digest::MD5 qw(md5);
my #directories_to_search = ('a','e');
my %hash;
find(\&wanted, #directories_to_search);
sub wanted {
chdir $File::Find::dir;
if( -f $_) {
my $con = '';
open F,"<",$_ or die;
while(my $line = <F>) {
$con .= $line;
}
close F;
if($hash{md5($con)}) {
print "Dup found: $File::Find::name and $hash{md5($con)}\n";
} else {
$hash{md5($con)} = $File::Find::name;
}
}
}
MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content
PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux
EDIT:
Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))
However, now that it is easy to
generate MD5 collisions, it is
possible for the person who created
the file to create a second file with
the same checksum, so this technique
cannot protect against some forms of
malicious tampering. Also, in some
cases the checksum cannot be trusted
(for example, if it was obtained over
the same channel as the downloaded
file), in which case MD5 can only
provide error-checking functionality:
it will recognize a corrupt or
incomplete download, which becomes
more likely when downloading larger
files.
Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.
Regex will not help with this problem.
There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)