Regex to identify reference to environment variables in bash

Regex to identify reference to environment variables in bash - regex

We know that 'eval' can do evil things: Like execute code.
I need a way in BASH to be able to identify in a string of characters where an environment variable is used, and ultimately replace it with its actual value.
e.g. This is a very simple example of something much more complex.
File "x.dat" contains:
$MYDIR/file.txt
Environment
export MYDIR=/tmp/somefolder
Script "x.sh"
...
fileToProcess=$(cat x.dat)
realFileToProcess=$(eval echo $fileToProcess)
echo $realFileToProcess
...
Keep in mind that referenced environment variables in a string can also be:
${MYDIR}_txt
$MYDIR-txt
${MYDIR:0:3}:txt
${MYDIR:5}.txt

Not an aswer yet but some remarks about the question.
It seems what you need is not variable expansion but token replacement in a template, depending the use case printf may be sufficient.
variable expansion depends also on context for example following are not replaced:
# single quotes
echo '${MYDIR}'
# ANSI-C quotes
echo $'${MYDIR}'
# heredoc with end marker enclosed between single quotes
cat << 'END'
${MYDIR}
END
Should be noted also that a variable expansion may execute arbitrary code:
echo ${X[`echo hi >&2`]}

Related

grep for a list of variables that are referenced by a script?

I have some scripts that use various variables, and I want to grep (from within bash on FreeBSD) each of them for the list of variables that are used by the script. These are not shell scripts, but the syntax of referencing a variable is similar to that used in bash. Specifically, a reference to a variable can be like:
$X, or
${X}
and the name of the variable ("X" in this case) can include alphanumerics and underscores. At this point I want to explicitly note that I imagine that bash itself probably has a more complicated set of possible ways to reference variables, but if so, I do not care about that for the purposes of this question.
I would like to find all variable names that are so referenced in a given file - just the name, not the entire line. So something of the form awesomegrepcmd filename | sort | uniq, or something like that.
Note that there may be multiple different variables referenced on a single line.
I have been fighting regex and escaping rules via largely ignorant stab-in-the-dark attempts (e.g. "OK, maybe I need to put TWO backslashes here, but only ONE there") for a while now, and googling to try to find others who have done this sort of thing, but have thus far been unable to accomplish what I want. How can this be done? Thanks.
EDIT: Example: Let's say the content of a file is this (just pseudocode, not meant as bash or anything):
SPECIFICS={ "be born", "die", "plant", "reap", "kill", "heal", "laugh", "weep" }
CHORUS_PREFIX="To every thing"
CHORUS_INFIX="there is a season"
CHORUS_SUFFIX="and a time to every purpose under heaven."
CHORUS_SEPARATOR=", turn, turn, turn, "
CHORUS = $CHORUS_PREFIX + ${CHORUS_SEPARATOR} + $CHORUS_INFIX + $CHORUS_SEPARATOR + ${CHORUS_SUFFIX}
SPECIFIC_PREFIX="A time to "
UNUSED=${SOMETHING}
echo $CHORUS
foreach SPECIFIC in $SPECIFICS {
echo $SPECIFIC_PREFIX + " " + ${SPECIFIC}
}
echo ${CHORUS}
Then the output I would want would be:
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFIC_PREFIX
SPECIFICS
A couple things to note:
${CHORUS} and $CHORUS (for example) both refer to the same variable, named CHORUS
The UNUSED variable is not in the output, because it is not actually used in the script (despite having been defined in it)
The SOMETHING variable, used only to define the unused UNUSED variable, is in the output, since it was used
The various CHORUS_xxx variables are only used in the one single line defining the CHORUS variable, but they are all present in the output.

Using an Extended regular expression:
me#lappy386:/tmp$ grep -Eoi '\$(\{[a-z0-9_]+\}|[a-z0-9_]+)' \
/tmp/example |tr -d '{}$'|sort|uniq
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFIC_PREFIX
SPECIFICS

Using grep and sed
grep -oE '\$[A-Za-z_]+|\${[A-Za-z_]+}' inputFile| sed -r 's/[${}]//g' | sort | uniq
Example :
$ grep -oE '\$[A-Za-z_]+|\${[A-Za-z_]+}' inputFile | sed -r 's/[${}]//g' | sort | uniq
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFICS
SPECIFIC_PREFIX

What is the best way to specify a wildcard or regex in a "test" statement in configure.ac?

I am writing a configure.ac script for gnu autotools. In my code I have some if test statements where I want to set flags based on the compiler name. My original test looked like this:
if test "x$NETCDF_FC" = xifort; then
but sometimes the compiler name is more complicated (e.g., mpifort, mpiifort, path prepended, etc...), and so I want to check if the string ifort is contained anywhere within the variable $NETCDF_FC.
As far as I can understand, to set up a comparison using a wildcard or regex, I cannot use test but instead need to use the double brackets [[ ]]. But when configure.ac is parsed by autoconf to create configure, square brackets are treated like quotes and so one level of them is stripped from the output. The only solution I could get to work is to use triple brackets in my configure.ac, like this:
if [[[ $NETCDF_FC =~ ifort ]]]; then
Am I doing this correctly? Would this be considered best practices for configure.ac or is there another way?

Use a case statement. Either directly as shell code:
case "$NETCDF_FC" in
*ifort*)
do_whatever
;;
*)
do_something_else
;;
esac
or as m4sh code:
AS_CASE([$NETCDF_FC],
[*ifort*], [do_whatever],
[do_something_else])
I would not want to rely on a shell capable of interpreting [[ ]] or [[[ ]]] being present at configure runtime (you need to escape those a bit with [] to have the double or triple brackets make it into configure).
If you need a character class within a case pattern (such as e.g. *[a-z]ifort*), I would advise you to check the generated configure file for the case statement patterns which actually end up being used until you have enough [] quotes added around the pattern in the source configure.ac file.
Note that the explicit case statements often contain # ( shell comments at the end of the lines directly before the ) patterns to avoid editors becoming confused about non-matching opening/closing parentheses.

Finding and modifying function definitions (C++) via bash-script

Currently I am working on a fairly large project. In order to increase the quality of our code, we decided to enforce the treatement of return values (Error Codes) for every function. GCC supports a warning concerning the return value of a function, however the function definition has to be preceeded by the following flag.
static __attribute__((warn_unused_result)) ErrorCode test() { /* code goes here */ }
I want to implement a bashscript that parses the entire source code and issues a warning in case the
__attribute__((warn_unused_result))
is missing.
Note that all functions that require this kind of modification return a type called ErrorCode.
Do you think this is possible via a bash script ?

Maybe you can use sed with regular expressions. The following worked for me on a couple of test files I tried:
sed -r "s/ErrorCode\s+\w+\s*(.*)\s*\{/__attribute__((warn_unused_result)) \0/g" test.cpp
If you're not familiar with regex, the pattern basically translates into:
ErrorCode, some whitespace, some alphanumerics (function name), maybe some whitespace, open parenthesis, anything (arguments), close parenthesis, maybe some whitespace, open curly brace.
If this pattern is found, it is prefixed by __attribute__((warn_unused_result)). Note that this only works if you are putting the open curly brace always in the same line as the arguments and you don't have line breaks in your function declarations.

An easy way I could imagine is via ctags. You create a tag file over all your source code, and then parse the tags file. However, I'm not quite sure about the format of the tags file. The variant I'm using here (Exuberant Ctags 5.8) seems to put an "f" in the fourth column, if the tag represents a function. So in this case I would use awk to filter all tags that represent functions, and then grep to throw away all lines without __attribute__((warn_unused_result)).
So, in a nutshell, first you do
$ ctags **/*.c
This creates a file called "tags" in the current directory. The command might also be ctags-exuberant, depending on your variant. The **/*.c is a glob pattern that might work in your shell - if it doesn't, you have to supply your source files in another way (look at the ctagsoptions).
Then you filter the funktions:
$ cat tags | awk -F '\t' '$4 == "f" {print $0}' | grep -v "__attribute__((warn_unused_result))"

No, it is not possible in the general case. The C++ grammar is the most complex of all the languages I know of, and C++ is not parsable via regular expressions in the general case. You might succeed if you limit yourself to a very narrow set of uses, but I am not sure how feasible it is in your case.
I also do not think the excersise is worth the effort, since sometimes ignoring the result of the function is an OK thing.

Powershell: Read a section of a file into a variable

I'm trying to create a kind of a polyglot script. It's not a true polyglot because it actually requires multiple languages to perform, although it can be "bootstrapped" by either Shell or Batch. I've got this part down no problem.
The part I'm having trouble with is a bit of embedded Powershell code, which needs to be able to load the current file into memory and extract a certain section that is written in yet another language, store it in a variable, and finally pass it into an interpreter. I have an XML-like tagging system that I'm using to mark sections of the file in a way that will hopefully not conflict with any of the other languages. The markers look like this:
lang_a_code
# <{LANGB}>
... code in language B ...
... code in language B ...
... code in language B ...
# <{/LANGB}>
lang_c_code
The #'s are comment markers, but the comment markers can be different things depending on the language of the section.
The problem I have is that I can't seem to find a way to isolate just that section of the file. I can load the entire file into memory, but I can't get the stuff between the tags out. Here is my current code:
#ECHO OFF
SETLOCAL EnableDelayedExpansion
powershell -ExecutionPolicy unrestricted -Command ^
$re = '(?m)^<{LANGB}^>(.*)^<{/LANGB}^>';^
$lang_b_code = ([IO.File]::ReadAllText(^'%0^') -replace $re,'$1');^
echo "${re}";^
echo "Contents: ${lang_b_code}";
Everything I've tried so far results in the entire file being output in the Contents rather than just the code between the markers. I've tried different methods of escaping the symbols used in the markers, but it always results in the same thing.
NOTE: The use of the ^ is required because the top-level interpreter is Batch, which hangs up on the angle brackets and other random things.

Since there is just one block, you can use the regex
$re = '(?s)^<{LANGB}^>(.*)^^.*^<{/LANGB}^>';^
but with -match operator, and then access the text using $matches[1] variable that is set as a result of -match.
So, after the regex declaration, use
[IO.File]::ReadAllText(^'%0^') -match $re;^
echo $matches[1];

Bash quote behavior and sed

I wrote a short bash script that is supposed to strip the leading tabs/spaces from a string:
#!/bin/bash
RGX='s/^[ \t]*//'
SED="sed '$RGX'"
echo " string" | $SED
It works from the command line, but the script gets this error:
sed: -e expression #1, char 1: unknown command: `''
My guess is that something is wrong with the quotes, but I'm not sure what.

Putting commands into variables and getting them back out intact is hard, because quoting doesn't work the way you expect (see BashFAQ #050, "I'm trying to put a command in a variable, but the complex cases always fail!"). There are several ways to deal with this:
1) Don't do it unless you really need to. Seriously, unless you have a good reason to put your command in a variable first, just execute it and don't deal with this messiness.
2) Don't use eval unless you really really really need to. eval has a well-deserved reputation as a source of nasty and obscure bugs. They can be avoided if you understand them well enough and take the necessary precautions to avert them, but this should really be a last resort.
3) If you really must define a command at one point and use it later, either define it as a function or an array. Here's how to do it with a function:
RGX='s/^[ \t]*//'
SEDCMD() { sed "$RGX"; }
echo " string" | SEDCMD
Here's the array version:
RGX='s/^[ \t]*//'
SEDCMD=(sed "$RGX")
echo " string" | "${SEDCMD[#]}"
The idiom "${SEDCMD[#]}" lets you expand an array, keeping each element a separate word, without any of the problems you're having.

It does. Try:
#!/bin/bash
RGX='s/^[ \t]*//'
#SED='$RGX'
echo " string" | sed "$RGX"
This works.
The issue you have is with quotes and spaces. Double quoted strings are passed as single arguments.

Add set -x to your script. You'll see that variables within a single-quote mark are not expanded.

+To expand on my comment above:
#!/bin/bash
RGX='s/^[[:space:]]+//'
SED="sed -r '$RGX'"
eval "printf \" \tstring\n\" | $SED"
Note that this also makes your regex an extended one, for no particular reason. :-)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to identify reference to environment variables in bash - regex

Related

grep for a list of variables that are referenced by a script?

What is the best way to specify a wildcard or regex in a "test" statement in configure.ac?

Finding and modifying function definitions (C++) via bash-script

Powershell: Read a section of a file into a variable

Bash quote behavior and sed

Categories

Resources