bsd_glob behaving differently on different machines - regex

I am using bsd_glob to get a list of files matching a regular expression for file path. My perl utility is working on RHEL, but not on Suse 11/AIX/Solarix, for the exact same set of files and the same regular expression. I googled for any limitations of bsd_glob, but couldn't find much information. Can someone point what's wrong?
Below is the regular expression for the file path I am searching for:
/datafiles/data_one/level_one/*/DATA*
I need all files beginning with DATA, in any directory present under 'level_one'.
This works perfectly on my RHEL box, but not on any other Unix and Suse Linux.
Below is the code snipped where I am using bsd_glob
foreach my $file (bsd_glob ( "$fileName", GLOB_ERR )) {
if ($fileName =~ /[[:alnum:]]\*\/\*$/) {
next if -d $file;
$fileList{$file} = $permissions;
$total++;
}
elsif ($fileName =~ /[[:alnum:]]\*$/) {
$fileList{$file} = $permissions;
$total++;
}
else {
$fileList{$file} = $permissions;
$total++;
}
}
In this case where I am facing the issue, /datafiles/data_one/level_one/*/DATA* is being passed to bsd_glob. I am creating a map ($fileList) of files that are returned by bsd_glob based on the regular expression I am passing to it. $permissions is a predefined value.
Any help is appreciated.

The problem here looks to be that you're confusing glob patterns and regular expressions.
/[[:alnum:]]\*\/\*$/
/[[:alnum:]]\*$/
You're looking for a file called * with that, under a directory containing a literal *.
Whilst that is technically possible it's really very strange. And simply cannot ever match the patterns your glob should find.
Do you perhaps mean:
m,\w+.*/.*$,
(different delimiter for clarity)
Also - why are you using bsd_glob specifically? From File::Glob:
Since v5.6.0, Perl's CORE::glob() is implemented in terms of bsd_glob(). Note that they don't share the same prototype--CORE::glob() only accepts a single argument. Due to historical reasons, CORE::glob() will also split its argument on whitespace, treating it as multiple patterns, whereas bsd_glob() considers them as one pattern. But see :bsd_glob under EXPORTS, below.
Comment:
I used bsd_glob instead of glob as there was slight difference in the way it works on different UNIX platforms. Specifically, for the above mentioned pattern, on some UNIX platforms, it didn't return a file having exact name 'DATA', and only returned files with something appended to DATA.
I'm a little surprised at that, as they should be implementing the same mechanisms and the same POSIX standard on globbing. Is there any chance there's a permissions related problem instead?
But otherwise you could perhaps try not using glob to do the heavy lifting, and instead just compare the file name to a bunch of regular expressions. (Although note - REs have very different syntax)
foreach my $file ( glob('/datafiles/data_one/level_one/*/*') ) {
next unless $filename =~ m,DATA\w+$,;
}

Related

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?
Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.
An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"
No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

Powershell: Read a section of a file into a variable

I'm trying to create a kind of a polyglot script. It's not a true polyglot because it actually requires multiple languages to perform, although it can be "bootstrapped" by either Shell or Batch. I've got this part down no problem.
The part I'm having trouble with is a bit of embedded Powershell code, which needs to be able to load the current file into memory and extract a certain section that is written in yet another language, store it in a variable, and finally pass it into an interpreter. I have an XML-like tagging system that I'm using to mark sections of the file in a way that will hopefully not conflict with any of the other languages. The markers look like this:
lang_a_code
# <{LANGB}>
... code in language B ...
... code in language B ...
... code in language B ...
# <{/LANGB}>
lang_c_code
The #'s are comment markers, but the comment markers can be different things depending on the language of the section.
The problem I have is that I can't seem to find a way to isolate just that section of the file. I can load the entire file into memory, but I can't get the stuff between the tags out. Here is my current code:
#ECHO OFF
SETLOCAL EnableDelayedExpansion
powershell -ExecutionPolicy unrestricted -Command ^
$re = '(?m)^<{LANGB}^>(.*)^<{/LANGB}^>';^
$lang_b_code = ([IO.File]::ReadAllText(^'%0^') -replace $re,'$1');^
echo "${re}";^
echo "Contents: ${lang_b_code}";
Everything I've tried so far results in the entire file being output in the Contents rather than just the code between the markers. I've tried different methods of escaping the symbols used in the markers, but it always results in the same thing.
NOTE: The use of the ^ is required because the top-level interpreter is Batch, which hangs up on the angle brackets and other random things.
Since there is just one block, you can use the regex
$re = '(?s)^<{LANGB}^>(.*)^^.*^<{/LANGB}^>';^
but with -match operator, and then access the text using $matches[1] variable that is set as a result of -match.
So, after the regex declaration, use
[IO.File]::ReadAllText(^'%0^') -match $re;^
echo $matches[1];

extract audio from certain files in working dir in perl

Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
#snip...
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
}
}
#snip...
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.
Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
system(...);
}
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.
Get rid of the double quotes around the regular expression in the grep function.

grep replacement with extensive regular expression implementation

I have been using grepWin for general searching of files, and wingrep when I want to do replacements or what-have-you.
GrepWin has an extensive implementation of regular expressions, however doesn't do replacements (as mentioned above).
Wingrep does replacements, however has a severely limited range of regular expression implementation.
Does anyone know of any (preferably free) grep tools for windows that does replacement AND has a reasonable implementation of regular expressions?
Thanks in advance.
I think perl at the command line is the answer you are looking for. Widely portable, powerful regex support.
Let's say that you have the following file:
foo
bar
baz
quux
you can use
perl -pne 's/quux/splat!/' -i /tmp/foo
to produce
foo
bar
baz
splat!
The magic is in Perl's command line switches:
-e: execute the next argument as a perl command.
-n: execute the command on every line
-p: print the results of the command, without issuing an explicit
'print' statement.
-i: make substitutions in place. overwrite the document with the
output of your command... use with caution.
I use Cygwin quite a lot for this sort of task.
Unfortunately it has the world's most unintuitive installer, but once it's installed correctly it's very usable... well apart from a few minor issues with copy and paste and the odd issue with line-endings.
The good thing is that all the tools work like on a real GNU system, so if you're already familiar with Linux or similar, you don't have to learn anything new (apart from how to use that crazy installer).
Overall I think the advantages make up for the few usability issues.
If you are on Windows, you can use vbscript (requires no downloads). It comes with regex. eg change "one" to "ONE"
Set objFS=CreateObject("Scripting.FileSystemObject")
Set WshShell = WScript.CreateObject("WScript.Shell")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
strFileContents = objFile.ReadAll
Set objRE = New RegExp
objRE.Global = True
objRE.IgnoreCase = False
objRE.Pattern = "one"
strFileContents = objRE.Replace(strFileContents,"ONE") 'simple replacement
WScript.Echo strFileContents
output
C:\test>type file
one
two one two
three
C:\test>cscript //nologo test.vbs file
ONE
two ONE two
three
You can read up vbscript doc to learn more on using regex

Regular Expression to find files with various extensions like-ASPX,ASCX,.js,.rpt,.xml

Is there any way to write a RegEx which can be used to find files with different Extensions.
This works in Bash:
find . -regex '.*\\.\\(pdf\|chm\|doc\\)'
Assuming you have a list of files and you are looking for .pdf, .chm and .doc, you can check it with:
\.pdf$|\.chm$|\.doc$
Regex above should work if you will check it against single filenames.
I'm sure there is, but the question you should be asking is "What's the best way to find files which have specific extensions?".
Regular expressions are not the best answer to every question.
I would suggest just getting a list of all files and passing them into a function like IsThisFileOneIWant(fileName,extensionList). That's far easier than trying to shoehorn the use of regular expressions into your problem.
Something like this should do it:
function IsThisFileOneIWant(fileName,extensionList):
for each extension in extensionList:
if fileName.endsWith (extension):
return true
return false
Done in pseudo-code since it should be simple enough to turn into any other language.
If you must have a regex, it's going to look something like (based on the values in your question):
"ASPX$|ASCX$|\.js$|\.rpt$|\.xml$"
but it depends entirely on the RE engine that you want to use. For example, here's the output from an egrep command in my work directory:
pax#paxbox1:~/work$ ls -1 | egrep '\.sh$|\.c$'
backup0.sh
backup1.sh
eclipse.sh
monbt.sh
qq.c
qq.sh
xx yy.sh