Bash: Pass all arguments exactly as they are to a function and prepend a flag on each of them - regex

This seems like a relatively basic question, but I can't find it anywhere after an hour of searching. Many (there are a lot!) of the similar questions do not seem to hit the point.
I am writing a script ("vims") to use vim in a sed-like mode (so I can call normal vim commands on a stream input without actually opening vim), so I need to pass each argument to vim with a "-c" flag prepended to it. There are also many characters that need to be escaped (I need to pass regex expressions), so some of the usual methods on SO do not work.
Basically, when I write:
cat myfile.txt | vims ':%g/foo/exe "norm yyPImyfile: \<esc>\$dF,"' ':3p'
which are two command-line vim arguments to run on stdout,
I need these two single-quoted arguments to be passed exactly the way they are to my function vims(), which then tags each of them with a -c flag, so they are interpreted as commands in vim.
Here's what I've tried so far:
vims() {
vim - -nes -u NONE -c '$1' -c ':q!' | tail -n +2
}
This seems to work perfectly for a single command. No characters get escaped, and the "-c" flag is there.
Then, using the oft-duplicated question-answer, the "$#" trick, I tried:
vims() {
vim - -nes -u NONE $(for arg in "$#"; do echo -n " -c $arg "; done) -c ':q!' | tail -n +2
}
This seems to break the spaces within each string I pass it, so does not work. I also tried a few variations of the printf command, as suggested in other questions, but this has weird interactions with the vim command sequences. I've tried many other different backslash-quote-combinations in a perpetual edit-test loop, but have always found a quirk in my method.
What is the command sequence I am missing?

Add all the arguments to an array one at a time, then pass the entire array to vim with proper quoting to ensure whitespace is correctly preserved.
vims() {
local args=()
while (($# > 0)); do
args+=(-c "$1")
shift
done
vim - -nes -u NONE "${args[#]}" -c ':q!' | tail -n +2
}
As a rule of thumb, if you find yourself trying to escape things, add backslashes, use printf, etc., you are likely going down the wrong path. Careful use of quoting and arrays will cover most scenarios.

Related

Sed replacing only part of a longer match with a shorter replacement:

So I'm measuring the total time elapsed of c program. By doing so I have been running this shell script that uses sed to replace the value of a constant (below: N) defined somewhere in the middle of a line in my c program.
#define N 10 // This constant will be incremented by shell program
Before you tell me that I should be using a variable and timing the function that uses it, I have to time the whole execution of the program externally on a single run (meaning no reassignment of N).
I've been using the following in a shell script to help out:
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
That replaces a 3 digit number with whatever my INCREMENTINGVAR (replacement) is. However, this doesn't seem to work properly for me when the replacement is 2 digits long. Sed replaces only the first two characters and leaves the the previous 3rd digit from the previous run without deleting it.
TESTS=0
while [ $TESTS -lt 3 ]
do
echo "This is test: $TESTS"
INCREMENTINGVAR=10
while [ "$INCREMENTINGVAR" -lt 10 ]
do
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
rm -f myprog.c.bak
echo "$INCREMENTINGVAR"
gcc myprogram.c -o myprogram.out; ./myprogram.out
INCREMENTINGVAR=$((INCREMENTINGVAR+5))
done
TESTS=$((TESTS+1))
done
Is there something I should do instead?
edit: Added whole shell script; Changed pattern for sed.
Do you simply want to replace whatever digit string is on line 11 with the new value? If so, you'd write:
sed -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/"
That looks for a sequence of one or more digits, and replaces it by the current value in $INCREMENTINGVAR. This will rollover from 9 to 10, and from 99 to 100, and from 999 to 1000, etc. Indeed, there's nothing to stop you jumping from 1 to 987,654 if that's what you want to do.
With the GNU and BSD (Mac OS X) versions of sed, you could overwrite the file automatically. The portable way (meaning, works the same with both GNU and BSD variants of sed), is:
sed -i.bak -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This creates a backup file (and removes it). The problem is that GNU sed requires just -i and BSD sed requires -i '' (two arguments) to do an in situ change without a backup. You can decide that portability is not relevant.
Note that using line number to identify what must be changed is delicate; trivial changes (a new header, more commentary) could change the line number. It would probably be better to use a context search:
sed -i.bak -e "/^#define N [0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This assumes spaces between define and N and the number. If you might have blanks or tabs in it, then you might write:
sed -i.bak -e "/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
That looks for optional leading white space before the #, optional white space between the # and the define, mandatory white space (at least one, possibly many) between define and N, and mandatory white space again between N and the first digit of the number. But probably your input isn't that sloppy and a simpler search pattern (like the first option) is sufficient to meet your needs. You could also write code to normalize eccentrically formatted #define lines into a canonical representation — but again, you most probably don't need to.
If you have somewhere else in the same file that contains something like this:
#undef N
#define N 100000
you would have to worry about the pattern matching this line too. However, few files do that; it isn't likely to be a problem in practice (and if it is, the code in general probably has more problems than can be dealt with here). One possibility would be to limit the range to the first 30 lines, assuming the first #define N 123 is somewhere in that range and the second is not.
sed -i.bak -e "1,30 { /^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
There are multiple other tricks that could be pulled to limit the damage, with varying degrees of verbosity. For example:
sed -i.bak -e "1,/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/ \
s/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/#define N $INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
Working with regexes is generally a judgement call between specificity and verbosity — you can make things incredibly safe but incredibly difficult to read, or you can run a small risk that your more readable code will match something unintended.

grep not matching strings when they come from a variable

I'm writing a script that is helping me process log files. In it, I have my grep flags stored in a variable. The flags and strings themselves work just fine, but when I pass them to grep using a variable, the parts of the string that use escaped characters don't produce any matches. See below:
grepvars="-B4 -Psihe 'caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe|\tat\s'"
grep -B4 -Psihe 'caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe|\tat\s' adapter_15.log > adapter_15-error1.log
grep $grepvars adapter_15.log > adapter_15-error2.log
wc -l *-error?.log
51398 adapter_15-error1.log
25032 adapter_15-error2.log
As you can see, the \tat\s part does not produce matches when passed through a variable to grep. What that is supposed to match is a (literal tab)at(literal space). Although this works correctly without using a variable, I'd rather use one since it makes my multiple grep calls easier to manage. What do I have to do to ensure that grep will perform this match correctly when passed through a variable?
After not having any sort of luck with this, I found a workaround: create a function and call it when needed. Here's what I came up with:
grep4j () {
unset IFS
nice -n 15 grep -B3 -Psihe '\tat\s|caused\sby|unable|fault|error|deadlock|checkpoint|corrupt|fail|exception|fatal|severe' $1
IFS=$'\n'
}
Yes, I did try unsetting IFS before and after the grep strings that were using the varaible. It didn't work (and I need it to be set for other things to work). Doing the function like this met my needs, and maybe it will help someone else as well. Cheers!
In case you're curious, this is designed to get relevant messages out of log4j-formatted logs. It saves me a lot of time.
If you're storing all the options of grep in a string then I guess you need to use evil eval:
str="grep $grepvars adapter_15.log > adapter_15-error2.log"
eval "$str"
It may be easier to stuff options into the environment variable GREP_OPTIONS, and patterns into a file, like so:
grep -f <file-with-patterns> ...

Unpredictable behavior in sed interpreters output from multiple expressions

Why does GNU sed sometimes handle substitution with piped output into another sed instance differently than when multiple expressions are used with the same one?
Specifically, for msys/mingw sessions, in the /etc/profile script I have a series of manipulations that "rearrange" the order of the environment variable PATH and removes duplicate entries.
Take note that while normally sed treats each line of input seperately (and therfore can't easily substitute '\n' in the input stream, this sed statement does a substitution of ':' with '\n', so it still handles the entire input stream like one line (with '\n' characters in it). This behavior stays true for all sed expressions in the same instance of sed (basically until you redirect or pipe the output into another program).
Here's the obligatory specs:
Windows 7 Professional Service Pack 1
HP Pavilion dv7-6b78us
16 GB DDR3 RAM
MinGW-w64 (x86_64-w64-mingw32-gcc-4.7.1.2-release-win64-rubenvb) mounted on /mingw/
MSYS (20111123) mounted on / and on /usr/
$ uname -a="MINGW32_NT-6.1 CHRIV-L09 1.0.17(0.48/3/2) 2011-04-24 23:39 i686 Msys"
$ which sed="/bin/sed.exe" (it's part of MSYS)
$ sed --version="GNU sed version 4.2.1"
This is the contents of PATH before manipulation:
PATH='.:/usr/local/bin:/mingw/bin:/bin:/c/PHP:/c/Program Files (x86)/HP SimplePass 2011/x64:/c/Program Files (x86)/HP SimplePass 2011:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/si:/c/android-sdk:/c/android-sdk/tools:/c/android-sdk/platform-tools:/c/Program Files (x86)/WinMerge:/c/ntp/bin:/c/GnuWin32/bin:/c/Program Files/MySQL/MySQL Server5.5/bin:/c/Program Files (x86)/WinSCP:/c/Program Files (x86)/Overlook Fing 2.1/bin:/c/Program Files/7-zip:.:/c/Program Files/TortoiseGit/bin:/c/Program Files (x86)/Git/bin:/c/VS10/VC/bin/x86_amd64:/c/VS10/VC/bin/amd64:/c/VS10/VC/bin'
This is an excerpt of /etc/profile (where I have begun the PATH manipulation):
set | grep --color=never ^PATH= | sed -e "s#^PATH=##" -e "s#'##g" \
-e "s/:/\n/g" -e "s#\n\(/[^\n]*tortoisegit[^\n]*\)#\nZ95-\1#ig" \
-e "s#\n\(/[a-z]/win\)#\nZ90-\1#ig" -e "s#\n\(/[a-z]/p\)#\nZ70-\1#ig" \
-e "s#\.\n#A10-.\n#g" -e "s#\n\(/usr/local/bin\)#\nA15-\1#ig" \
-e "s#\n\(/bin\)#\nA20-\1#ig" -e "s#\n\(/mingw/bin\)#\nA25-\1#ig" \
-e "s#\n\(/[a-z]/vs10/vc/bin\)#\nA40-\1#ig"
The last sed expression in that line basically looks for lines that begins with "/c/VS10/VC/bin" and prepends them with 'A40-' like this:
...
/c/si
A40-/c/VS10/VC/bin
A40-/c/VS10/VC/bin/amd64
A40-/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
I like my sed expressions to be flexible (path structures change), but I don't want it to match the lines that end with amd64 or x86_amd64 (those are going to have a different string prepended). So I change the last expression to:
-e "s#\n\(/[a-z]/vs10/vc/bin\)\n#\nA40-\1\n#ig"
This works:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
Then, (to match any "line" matching the pseudocode "/x/.../bin") I change the last expression to:
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
Which produces:
...
/c/si
/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
??? - sed didn't match any character ('.') any number of times ('*') in the middle of the line ???
But, if I pipe the output into a different instance of sed (and compensate for sed handling each "line" seperately) like this:
| sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
I get:
sed: -e expression #1, char 30: unterminated `s' command
??? How is that unterminated? It's got all three '#' characters after the s, has the modifiers 'i' and 'g' after the third '#', and the entire expression is in double quotes ('"'). Also, there are no escapes ('\') immediately preceding the delimiters, and the delimiter is not a part of either the search or the replacement. Let's try a different delimiter than '#', like '~':
I use:
| sed -e "s~^(/[a-z]/.*/bin)$~A40-\1~ig"
and, I get:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
A40-/c/GnuWin32/bin
...
And, that is correct! The only thing I changed was the delimeter from '#' to '~' and it worked ???
This is not (even close to) the first time that sed has produced unexplainable results for me.
Why, oh, why, is sed NOT matching syntax in an expression in the same instance, but IS matching when piped into another instance of sed?
And, why, oh, why, do I have to use a different delimeter when I do this (in order not to get an "unterminated 's' command"?
And the real reason I'm asking: Is this a bug in sed, OR, is it correct behavior that I don't understand (and if so, can someone explain why this behavior is correct)? I want to know if I'm doing it wrong, or if I need a different/better tool (or both, they don't have to be mutually exclusive).
I'll mark a response it as the answer if someone can either prove why this behavior is correct or if they can prove why it is a bug. I'll gladly accept any advice about other tools or different methods of using sed, but those won't answer the question.
I'm going to have to get better at other text processors (like awk, tr, etc.) because sed is costing me too much time with it's unexplainable results.
P.S. This is not the complete logic of my PATH manipulation. The complete logic also finishes prepending all the lines with values from 'A00-' to 'Z99-', then pipes that output into 'sort -u -f' and back into sed to remove those same prefixes on each line and to convert the lines ('\n') back into colons (':'). Then "export PATH='" is prepended to the single line and "'" is appended to it. Then that output is redirected into a temporary file. Next, that temporary file is sourced. And, finally, that temporary file is removed.
The /etc/profile script also displays the contents of PATH before and after sorting (in case it screwed up the path).
P.P.S. I'm sure there is a much better way to do this. It started as some very simple sed manipulations, and grew into the monster you see here. Even if there is a better way, I still need to know why sed is giving me these results.
sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
is unterminated because the shell is trying to expand "$#A". Put your expressions in single quotes to avoid this.
The expression
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
fails, or doesn't do what you expect, because . matches the newline in a multi-line expression. Check your whole output, the A40- is at the very beginning. Change it to
-e "s#\n\(/[a-z]/[^\n]*/bin\)\n#\nA40-\1\n#ig"
and it might be more what you expect. This may very well be the case with most of your issues with multi-line modifications.
You can also put the statements, one per line, into a standalone file and invoke sed with sed -f editscript. It might make maintenance of this a bit easier.

Bash quote behavior and sed

I wrote a short bash script that is supposed to strip the leading tabs/spaces from a string:
#!/bin/bash
RGX='s/^[ \t]*//'
SED="sed '$RGX'"
echo " string" | $SED
It works from the command line, but the script gets this error:
sed: -e expression #1, char 1: unknown command: `''
My guess is that something is wrong with the quotes, but I'm not sure what.
Putting commands into variables and getting them back out intact is hard, because quoting doesn't work the way you expect (see BashFAQ #050, "I'm trying to put a command in a variable, but the complex cases always fail!"). There are several ways to deal with this:
1) Don't do it unless you really need to. Seriously, unless you have a good reason to put your command in a variable first, just execute it and don't deal with this messiness.
2) Don't use eval unless you really really really need to. eval has a well-deserved reputation as a source of nasty and obscure bugs. They can be avoided if you understand them well enough and take the necessary precautions to avert them, but this should really be a last resort.
3) If you really must define a command at one point and use it later, either define it as a function or an array. Here's how to do it with a function:
RGX='s/^[ \t]*//'
SEDCMD() { sed "$RGX"; }
echo " string" | SEDCMD
Here's the array version:
RGX='s/^[ \t]*//'
SEDCMD=(sed "$RGX")
echo " string" | "${SEDCMD[#]}"
The idiom "${SEDCMD[#]}" lets you expand an array, keeping each element a separate word, without any of the problems you're having.
It does. Try:
#!/bin/bash
RGX='s/^[ \t]*//'
#SED='$RGX'
echo " string" | sed "$RGX"
This works.
The issue you have is with quotes and spaces. Double quoted strings are passed as single arguments.
Add set -x to your script. You'll see that variables within a single-quote mark are not expanded.
+To expand on my comment above:
#!/bin/bash
RGX='s/^[[:space:]]+//'
SED="sed -r '$RGX'"
eval "printf \" \tstring\n\" | $SED"
Note that this also makes your regex an extended one, for no particular reason. :-)

Controlling shell command line wildcard expansion in C or C++

I'm writing a program, foo, in C++. It's typically invoked on the command line like this:
foo *.txt
My main() receives the arguments in the normal way. On many systems, argv[1] is literally *.txt, and I have to call system routines to do the wildcard expansion. On Unix systems, however, the shell expands the wildcard before invoking my program, and all of the matching filenames will be in argv.
Suppose I wanted to add a switch to foo that causes it to recurse into subdirectories.
foo -a *.txt
would process all text files in the current directory and all of its subdirectories.
I don't see how this is done, since, by the time my program gets a chance to see the -a, then shell has already done the expansion and the user's *.txt input is lost. Yet there are common Unix programs that work this way. How do they do it?
In Unix land, how can I control the wildcard expansion?
(Recursing through subdirectories is just one example. Ideally, I'm trying to understand the general solution to controlling the wildcard expansion.)
You program has no influence over the shell's command line expansion. Which program will be called is determined after all the expansion is done, so it's already too late to change anything about the expansion programmatically.
The user calling your program, on the other hand, has the possibility to create whatever command line he likes. Shells allow you to easily prevent wildcard expansion, usually by putting the argument in single quotes:
program -a '*.txt'
If your program is called like that it will receive two parameters -a and *.txt.
On Unix, you should just leave it to the user to manually prevent wildcard expansion if it is not desired.
As the other answers said, the shell does the wildcard expansion - and you stop it from doing so by enclosing arguments in quotes.
Note that options -R and -r are usually used to indicate recursive - see cp, ls, etc for examples.
Assuming you organize things appropriately so that wildcards are passed to your program as wildcards and you want to do recursion, then POSIX provides routines to help:
nftw - file tree walk (recursive access).
fnmatch, glob, wordexp - to do filename matching and expansion
There is also ftw, which is very similar to nftw but it is marked 'obsolescent' so new code should not use it.
Adrian asked:
But I can say ls -R *.txt without single quotes and get a recursive listing. How does that work?
To adapt the question to a convenient location on my computer, let's review:
$ ls -F | grep '^m'
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte/
$ ls -R1 m*
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte:
multithread.ec
multithread.ec.original
multithread2.ec
$
So, I have a sub-directory 'mte' that contains three files. And I have six files with names that start 'm'.
When I type 'ls -R1 m*', the shell notes the metacharacter '*' and uses its equivalent of glob() or wordexp() to expand that into the list of names:
makefile
mapmain.pl
minimac.group
minimac.passwd
minimac_13.terminal
mkmax.sql.bz2
mte
Then the shell arranges to run '/bin/ls' with 9 arguments (program name, option -R1, plus 7 file names and terminating null pointer).
The ls command notes the options (recursive and single-column output), and gets to work.
The first 6 names (as it happens) are simple files, so there is nothing recursive to do.
The last name is a directory, so ls prints its name and its contents, invoking its equivalent of nftw() to do the job.
At this point, it is done.
This uncontrived example doesn't show what happens when there are multiple directories, and so the description above over-simplifies the processing.
Specifically, ls processes the non-directory names first, and then processes the directory names in alphabetic order (by default), and does a depth-first scan of each directory.
foo -a '*.txt'
Part of the shell's job (on Unix) is to expand command line wildcard arguments. You prevent this with quotes.
Also, on Unix systems, the "find" command does what you want:
find . -name '*.txt'
will list all files recursively from the current directory down.
Thus, you could do
foo `find . -name '*.txt'`
I wanted to point out another way to turn off wildcard expansion. You can tell your shell to stop expanding wildcards with the the noglob option.
With bash use set -o noglob:
> touch a b c
> echo *
a b c
> set -o noglob
> echo *
*
And with csh, use set noglob:
> echo *
a b c
> set noglob
> echo *
*