extract audio from certain files in working dir in perl - regex

Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.

Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.

Get rid of the double quotes around the regular expression in the grep function.


How to run list of perl regex from file in terminal

I am working on a directory with many .txt files in them and have a file with looong list of regex like "perl -p -i -e 's/\n\n/\n/g' *.xml" they all work if I copy them to terminal. But is there a possibility to run them straight from the file?
I tried ./unicode.sh but that resulted in:
No such file or directory.
Here's a (mostly) equivalent Perl script to the oneliner perl -p -i -e 's/\n\n/\n/g' *.xml (one main difference being that this has strict and warnings enabled, which is strongly recommended), which you could expand upon by putting more code to modify the current line in the body of the while loop.
#!/usr/bin/env perl
use warnings;
use strict;
if (!#ARGV) { # if no files on command line
#ARGV = glob('*.xml'); # get a default list of files
local $^I = ''; # enable inplace editing (like perl -i)
while (<>) { # read each line of each file into $_
s/\n\n/\n/g; # modify $_ with a regex
# more regexes here...
print; # write the line $_ back out
You can save this script in a file such as process.pl, and then run it with perl process.pl, or do chmod u+x process.pl and then run it via ./process.pl.
On the other hand, you really shouldn't modify XML files with regular expressions, there are lots of Perl modules to do XML processing - I wrote about that some more here. Also, in the example you showed, s/\n\n/\n/g actually won't have any effect, since when reading files line-by-line, no string will contain two \n's (you can change how Perl reads files, but I don't see any mention of that in the question).
Edit: You've named the script in your example unicode.sh - if you're processing Unicode files, then Perl has very powerful features to help with that, although the code won't necessarily end up as nice and short as I've showed above. You'll have to tell us some more about what you're doing, and show some example input and output, to get suggestions about that. See also e.g. perlunitut.
It's likely if you got no such file or directory, your problem was you forgot to make unicode.sh executable, as in chmod +x unicode.sh, assuming that's a script that you wrote.
Of course the normal way to run multiple perl commands is this thing that looks like runme.pl which you write, i.e., a perl script.
That said, yes, everything will work from the terminal, you just need to be careful about escaping that bash performs.

bsd_glob behaving differently on different machines

I am using bsd_glob to get a list of files matching a regular expression for file path. My perl utility is working on RHEL, but not on Suse 11/AIX/Solarix, for the exact same set of files and the same regular expression. I googled for any limitations of bsd_glob, but couldn't find much information. Can someone point what's wrong?
Below is the regular expression for the file path I am searching for:
I need all files beginning with DATA, in any directory present under 'level_one'.
This works perfectly on my RHEL box, but not on any other Unix and Suse Linux.
Below is the code snipped where I am using bsd_glob
foreach my $file (bsd_glob ( "$fileName", GLOB_ERR )) {
if ($fileName =~ /[[:alnum:]]\*\/\*$/) {
next if -d $file;
$fileList{$file} = $permissions;
elsif ($fileName =~ /[[:alnum:]]\*$/) {
$fileList{$file} = $permissions;
else {
$fileList{$file} = $permissions;
In this case where I am facing the issue, /datafiles/data_one/level_one/*/DATA* is being passed to bsd_glob. I am creating a map ($fileList) of files that are returned by bsd_glob based on the regular expression I am passing to it. $permissions is a predefined value.
Any help is appreciated.
The problem here looks to be that you're confusing glob patterns and regular expressions.
You're looking for a file called * with that, under a directory containing a literal *.
Whilst that is technically possible it's really very strange. And simply cannot ever match the patterns your glob should find.
Do you perhaps mean:
(different delimiter for clarity)
Also - why are you using bsd_glob specifically? From File::Glob:
Since v5.6.0, Perl's CORE::glob() is implemented in terms of bsd_glob(). Note that they don't share the same prototype--CORE::glob() only accepts a single argument. Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns, whereas bsd_glob() considers them as one pattern. But see :bsd_glob under EXPORTS, below.
I used bsd_glob instead of glob as there was slight difference in the way it works on different UNIX platforms. Specifically, for the above mentioned pattern, on some UNIX platforms, it didn't return a file having exact name 'DATA', and only returned files with something appended to DATA.
I'm a little surprised at that, as they should be implementing the same mechanisms and the same POSIX standard on globbing. Is there any chance there's a permissions related problem instead?
But otherwise you could perhaps try not using glob to do the heavy lifting, and instead just compare the file name to a bunch of regular expressions. (Although note - REs have very different syntax)
foreach my $file ( glob('/datafiles/data_one/level_one/*/*') ) {
next unless $filename =~ m,DATA\w+$,;

Powershell: Read a section of a file into a variable

I'm trying to create a kind of a polyglot script. It's not a true polyglot because it actually requires multiple languages to perform, although it can be "bootstrapped" by either Shell or Batch. I've got this part down no problem.
The part I'm having trouble with is a bit of embedded Powershell code, which needs to be able to load the current file into memory and extract a certain section that is written in yet another language, store it in a variable, and finally pass it into an interpreter. I have an XML-like tagging system that I'm using to mark sections of the file in a way that will hopefully not conflict with any of the other languages. The markers look like this:
# <{LANGB}>
... code in language B ...
... code in language B ...
... code in language B ...
# <{/LANGB}>
The #'s are comment markers, but the comment markers can be different things depending on the language of the section.
The problem I have is that I can't seem to find a way to isolate just that section of the file. I can load the entire file into memory, but I can't get the stuff between the tags out. Here is my current code:
SETLOCAL EnableDelayedExpansion
powershell -ExecutionPolicy unrestricted -Command ^
$re = '(?m)^<{LANGB}^>(.*)^<{/LANGB}^>';^
$lang_b_code = ([IO.File]::ReadAllText(^'%0^') -replace $re,'$1');^
echo "${re}";^
echo "Contents: ${lang_b_code}";
Everything I've tried so far results in the entire file being output in the Contents rather than just the code between the markers. I've tried different methods of escaping the symbols used in the markers, but it always results in the same thing.
NOTE: The use of the ^ is required because the top-level interpreter is Batch, which hangs up on the angle brackets and other random things.
Since there is just one block, you can use the regex
$re = '(?s)^<{LANGB}^>(.*)^^.*^<{/LANGB}^>';^
but with -match operator, and then access the text using $matches[1] variable that is set as a result of -match.
So, after the regex declaration, use
[IO.File]::ReadAllText(^'%0^') -match $re;^
echo $matches[1];

Reorganizing large amount of files with regex?

I have a large amount of files organized in a hierarchy of folders and particular file name notations and extensions. What I need to do, is write a program to walk through the tree of files and basically rename and reorganize them. I also need to generate a report of the changes and information about the transformed organization along with statistics.
The solution that I can see, is to walk through the tree of files just like any other tree data structure, and use regular expressions on the path name of the files. This seems very doable and not a huge amount of work. My questions are, is there tools I should be using other than just C# and regex? Perl comes to mind since I know it was originally designed for report generation, but I have no experience with the language. And also, is using regex for this situation viable, because I have only used it for file CONTENTS not file names and organization.
Yes, Perl can do this. Here's something pretty simple:
#! /usr/bin/env perl
use strict;
use warnings;
use File::Find;
my $directory = "."; #Or whatever directory tree you're looking for...
find (\&wanted, $directory);
sub wanted {
print "Full File Name = <$File::Find::name>\n";
print "Directory Name = <$File::Find::dir>\n";
print "Basename = <$_\n>";
# Using tests to see various things about the file
if (-f $File::Find::name) {
print "File <$File::Find::name> is a file\n";
if (-d $File::Find::name) {
print "Directory <$File::Find::name> is a directory\n";
# Using regular expressions on the file name
if ($File::Find::name =~ /beans/) { #Using Regular expressions on file names
print "The file <$File::Find::name> contains the string <beans>\n";
The find command takes the directory, and calls the wanted subroutine for each file and directory in the entire directory tree. It is up to that subroutine to figure out what to do with that file.
As you can see, you can do various tests on the file, and use regular expressions to parse the file's name. You can also move, rename, or delete the file to your heart's content.
Perl will do exactly what you want. Now, all you have to do is learn it.
If you can live with glob patterns instead of regular expressions, mmv might be an option.
> ls
a1.txt a2.txt b34.txt
> mmv -v "?*.txt" "#2 - #1.txt"
a1.txt -> 1 - a.txt : done
a2.txt -> 2 - a.txt : done
b34.txt -> 34 - b.txt : done
Directories at any depth can be reorganized, too. Check out the manual. If you run Windows, you can find the tool in Cygwin.

Finding duplicate files by content across multiple directories

I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the names of the files would be different, but the content may match.
Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?
if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.
find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in seen)){seen[$1]=$2}'
if you want to work with Python, a simple implementation
import hashlib,os
def sha(filename):
''' function to get sha of file '''
d = hashlib.sha512()
except Exception,e:
print e
return d.hexdigest()
for r,d,f in os.walk(path):
for files in f:
if not s.has_key(digest):
print "Duplicates: %s <==> %s " %( filename, s[digest])
if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)
You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?
The following is the Perl program to do the above thing:
use strict;
use File::Find;
use Digest::MD5 qw(md5);
my #directories_to_search = ('a','e');
my %hash;
find(\&wanted, #directories_to_search);
sub wanted {
chdir $File::Find::dir;
if( -f $_) {
my $con = '';
open F,"<",$_ or die;
while(my $line = <F>) {
$con .= $line;
close F;
if($hash{md5($con)}) {
print "Dup found: $File::Find::name and $hash{md5($con)}\n";
} else {
$hash{md5($con)} = $File::Find::name;
MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content
PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux
Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))
However, now that it is easy to
generate MD5 collisions, it is
possible for the person who created
the file to create a second file with
the same checksum, so this technique
cannot protect against some forms of
malicious tampering. Also, in some
cases the checksum cannot be trusted
(for example, if it was obtained over
the same channel as the downloaded
file), in which case MD5 can only
provide error-checking functionality:
it will recognize a corrupt or
incomplete download, which becomes
more likely when downloading larger
Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.
Regex will not help with this problem.
There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)