Powershell: Read a section of a file into a variable - regex

I'm trying to create a kind of a polyglot script. It's not a true polyglot because it actually requires multiple languages to perform, although it can be "bootstrapped" by either Shell or Batch. I've got this part down no problem.
The part I'm having trouble with is a bit of embedded Powershell code, which needs to be able to load the current file into memory and extract a certain section that is written in yet another language, store it in a variable, and finally pass it into an interpreter. I have an XML-like tagging system that I'm using to mark sections of the file in a way that will hopefully not conflict with any of the other languages. The markers look like this:
lang_a_code
# <{LANGB}>
... code in language B ...
... code in language B ...
... code in language B ...
# <{/LANGB}>
lang_c_code
The #'s are comment markers, but the comment markers can be different things depending on the language of the section.
The problem I have is that I can't seem to find a way to isolate just that section of the file. I can load the entire file into memory, but I can't get the stuff between the tags out. Here is my current code:
#ECHO OFF
SETLOCAL EnableDelayedExpansion
powershell -ExecutionPolicy unrestricted -Command ^
$re = '(?m)^<{LANGB}^>(.*)^<{/LANGB}^>';^
$lang_b_code = ([IO.File]::ReadAllText(^'%0^') -replace $re,'$1');^
echo "${re}";^
echo "Contents: ${lang_b_code}";
Everything I've tried so far results in the entire file being output in the Contents rather than just the code between the markers. I've tried different methods of escaping the symbols used in the markers, but it always results in the same thing.
NOTE: The use of the ^ is required because the top-level interpreter is Batch, which hangs up on the angle brackets and other random things.

Since there is just one block, you can use the regex
$re = '(?s)^<{LANGB}^>(.*)^^.*^<{/LANGB}^>';^
but with -match operator, and then access the text using $matches[1] variable that is set as a result of -match.
So, after the regex declaration, use
[IO.File]::ReadAllText(^'%0^') -match $re;^
echo $matches[1];

Related

What is the best way to specify a wildcard or regex in a "test" statement in configure.ac?

I am writing a configure.ac script for gnu autotools. In my code I have some if test statements where I want to set flags based on the compiler name. My original test looked like this:
if test "x$NETCDF_FC" = xifort; then
but sometimes the compiler name is more complicated (e.g., mpifort, mpiifort, path prepended, etc...), and so I want to check if the string ifort is contained anywhere within the variable $NETCDF_FC.
As far as I can understand, to set up a comparison using a wildcard or regex, I cannot use test but instead need to use the double brackets [[ ]]. But when configure.ac is parsed by autoconf to create configure, square brackets are treated like quotes and so one level of them is stripped from the output. The only solution I could get to work is to use triple brackets in my configure.ac, like this:
if [[[ $NETCDF_FC =~ ifort ]]]; then
Am I doing this correctly? Would this be considered best practices for configure.ac or is there another way?
Use a case statement. Either directly as shell code:
case "$NETCDF_FC" in
*ifort*)
do_whatever
;;
*)
do_something_else
;;
esac
or as m4sh code:
AS_CASE([$NETCDF_FC],
[*ifort*], [do_whatever],
[do_something_else])
I would not want to rely on a shell capable of interpreting [[ ]] or [[[ ]]] being present at configure runtime (you need to escape those a bit with [] to have the double or triple brackets make it into configure).
If you need a character class within a case pattern (such as e.g. *[a-z]ifort*), I would advise you to check the generated configure file for the case statement patterns which actually end up being used until you have enough [] quotes added around the pattern in the source configure.ac file.
Note that the explicit case statements often contain # ( shell comments at the end of the lines directly before the ) patterns to avoid editors becoming confused about non-matching opening/closing parentheses.

How to update path in Perl script in multiple files

I am working on creating some training material where I am using perl. One of the things I want to do is have the scripts be set up for the student correctly, regardless of where they extra the compressed files. I am working on a Windows batch file that will copy the perl templates to the working location and then update path in the copy of the perl template files to the correct location. The perl template have this as the first line:
#!_BASE_software/perl/bin/perl.exe
The batch file looks like this:
SET TRAINING=%~dp0
copy %TRAINING%\template\*.pl %TRAINING%work
%TRAINING%software\perl\bin\perl -pi.bak -e 's/_BASE_/%TRAINING%/g' %TRAINING%work\*.pl
I have a few problems with this:
Perl doesn't seem to like the wildcard in the filename
It turns out that %TRAINING% is going to expand into a string with backslashes which need to be converted into forwardslashes and needs to be escaped within the regex.
How do I fix this?
First of all, Windows doesn't use the shebang line, so I'm not sure why you're doing any of this work in the first place.
Perl will read the shebang line and look for options if perl is found in the path, even on Windows, but that means that #!perl is sufficient if you want to pass options via the shebang line (e.g. #!perl -n).
Now, it's possible that you use Cygwin, MSYS or some other unix emulation instead of Windows to run the program, but you are placing a Windows path in the shebang line (C:...) rather than a unix path, so that doesn't make sense either.
There are three additional problems with the attempt:
cmd uses double-quotes for quoting.
cmd doesn't perform wildcard expansion like sh, so it's up to your program do it.
You are trying to generate Perl code from cmd. ouch.
If we go ahead, we get:
"%TRAINING%software\perl\bin\perl" -MFile::DosGlob=glob -pe"BEGIN { #ARGV = map glob, #ARGV; $base = $ENV{TRAINING} =~ s{\\}{/}rg } s/_BASE_/$base/g" -i.bak -- %TRAINING%work\*.pl
If we add line breaks for readability, we get the following (that cmd won't accept):
"%TRAINING%software\perl\bin\perl"
-MFile::DosGlob=glob
-pe"
BEGIN {
#ARGV = map glob, #ARGV;
$base = $ENV{TRAINING} =~ s{\\}{/}rg
}
s/_BASE_/$base/g
"
-i.bak -- %TRAINING%work\*.pl

How to run list of perl regex from file in terminal

I'm fairly new to the whole coding game, and am very grateful for every answer!
I am working on a directory with many .txt files in them and have a file with looong list of regex like "perl -p -i -e 's/\n\n/\n/g' *.xml" they all work if I copy them to terminal. But is there a possibility to run them straight from the file?
I tried ./unicode.sh but that resulted in:
No such file or directory.
Any ideas?
Thank you so much!
Here's a (mostly) equivalent Perl script to the oneliner perl -p -i -e 's/\n\n/\n/g' *.xml (one main difference being that this has strict and warnings enabled, which is strongly recommended), which you could expand upon by putting more code to modify the current line in the body of the while loop.
#!/usr/bin/env perl
use warnings;
use strict;
if (!#ARGV) { # if no files on command line
#ARGV = glob('*.xml'); # get a default list of files
}
local $^I = ''; # enable inplace editing (like perl -i)
while (<>) { # read each line of each file into $_
s/\n\n/\n/g; # modify $_ with a regex
# more regexes here...
print; # write the line $_ back out
}
You can save this script in a file such as process.pl, and then run it with perl process.pl, or do chmod u+x process.pl and then run it via ./process.pl.
On the other hand, you really shouldn't modify XML files with regular expressions, there are lots of Perl modules to do XML processing - I wrote about that some more here. Also, in the example you showed, s/\n\n/\n/g actually won't have any effect, since when reading files line-by-line, no string will contain two \n's (you can change how Perl reads files, but I don't see any mention of that in the question).
Edit: You've named the script in your example unicode.sh - if you're processing Unicode files, then Perl has very powerful features to help with that, although the code won't necessarily end up as nice and short as I've showed above. You'll have to tell us some more about what you're doing, and show some example input and output, to get suggestions about that. See also e.g. perlunitut.
It's likely if you got no such file or directory, your problem was you forgot to make unicode.sh executable, as in chmod +x unicode.sh, assuming that's a script that you wrote.
Of course the normal way to run multiple perl commands is this thing that looks like runme.pl which you write, i.e., a perl script.
That said, yes, everything will work from the terminal, you just need to be careful about escaping that bash performs.

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?
Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.
An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"
No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

extract audio from certain files in working dir in perl

Basically, what I'm trying to do is extract the audio from a set of downloaded YouTube videos, the names of which are (partially) identified in a file (mus.txt) that was opened with the handle TXTFILELIST. TXTFILELIST contains one 11-character identifier for the video on each line (for example, "dQw4w9WgXcQ") and the downloaded file is of the form [title]-[ID].mp4 (in the previous example, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4").
#snip...
if ($opt_extract_audio) {
open(TXTFILELIST, "<", "mus.txt") or die $!;
my #all_dir_files = `dir /b`;
my $file_to_convert;
foreach $file_to_convert (<TXTFILELIST>) {
my #files = grep("/${file_to_convert}\.mp4$/", #all_dir_files); #the problem line!
print "files: #files\n";
foreach $file (#files) {
system("ffmpeg.exe -i ${file} -vn -y -acodec pcm_s16le -ac 2 ${file}.wav");
}
}
#snip...
The rest of the snipped code works (I checked it with several videos, replacing vars, commenting, etc.), is legal (I used the strict and warnings pragmas) and, I believe, is irrelevant, because it has nothing to do with defining any vars (besides $opt_extract_audio) used in this snippet. However, this is the one bit of code that's giving me trouble; I can't seem to extract the files that are identified in TXTFILELIST from #all_dir_files. I got the code for 'the problem line' from other Stack Overflow answerers, but it isn't working for some reason.
TL;DR What I want to do is this: list all files in the current dir (say the directory contains mus.txt, "Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp4", and blah.mp4), choose only the identified file(s) (the Rick Astley video) using the 11-char ID in TXTFILELIST (dQw4w9WgXcQ) and extract the audio from it. And yes, I am running this script on Windows, so I can't use *nix utilities like ack or find.
Remove the line
my #all_dir_files = `dir /b`;
And use this loop instead:
for my $file (<*${file_to_convert}.mp4>) {
say $file;
system(...);
}
The <...> above is a glob, can also be written glob "${file_to_convert}.mp4". I think it is almost always better to use perl functions rather than rely on system calls.
As has been pointed out, "/${file...$/" is not a regex, but a string. And since you can use expressions with grep, and a non-empty string is always true, your grep will essentially do nothing, and pass all the values into your array.
Get rid of the double quotes around the regular expression in the grep function.