error in grep using a regex expression - regex

I think I have uncovered an error in grep. If I run this grep statement against a db log on the command line it runs fine.
grep "Query Executed in [[:digit:]]\{5\}.\?" db.log
I get this result:
Query Executed in 19699.188 ms;"select distinct * from /xyztable.....
when I run it in a script
LONG_QUERY=`grep "Query Executed in [[:digit:]]\{5\}.\?" db.log`
the asterisk in the result is replaced with a list of all files in the current directory.
echo $LONG_QUERY
Result:
Query Executed in 19699.188 ms; "select distinct <list of files in
current directory> from /xyztable.....
Has anyone seen this behavior?

This is not an error in grep. This is an error in your understanding of how scripts are interpreted.
If I write in a script:
echo *
I will get a list of filenames, because an unquoted, unescaped, asterisk is interpreted by the shell (not grep, but /bin/bash or /bin/sh or whatever shell you use) as a request to substitute filenames matching the pattern '*', which is to say all of them.
If I write in a script:
echo "*"
I will get a single '*', because it was in a quoted string.
If I write:
STAR="*"
echo $STAR
I will get filenames again, because I quoted the star while assigning it to a variable, but then when I substituted the variable into the command it became unquoted.
If I write:
STAR="*"
echo "$STAR"
I will get a single star, because double-quotes allow variable interpolation.
You are using backquotes - that is, ` characters - around a command. That captures the output of the command into a variable.
I would suggest that if you are going to be echoing the results of the command, and little else, you should just redirect the results into a file. (After all, what are you going to do when your LONG_QUERY contains 10,000 lines of output because your log file got really full?)
Barring that, at the very least do echo "$LONG_QUERY" (in double quotes).

Related

What regex can I use to match and replace full stops in multiple filenames?

I am looking to replace full stops in a filename however I need to remove some and replace others.
The file names are structured like so:
A.A M12345678 SOMEWORD 20.08.2019.pdf
A.A M12345678 SOMEWORD1 SOMEWORD2 20.08.2019.pdf
I want the format to be the following:
AA M12345678 SOMEWORD 20-08-2019.pdf
AA M12345678 SOMEWORD1 SOMEWORD2 20-08-2019.pdf
So the first full stop should be removed but the full stop encountered in the date should be a hyphen (-).
I have been using command prompt but I am running into some issues as I am fairly new to regular expressions.
I have tried approaching the problem one step at a time namely by just focusing on replacing the date format.
I've tested my regex using https://regexr.com/ and it matches correctly.
[0-9]\K[.]
My understanding of the code above should match the full stops in the date.
However when I run the following command:
ren *[0-9]\K[.].pdf -
It fails to find the file.
Expected Result
AA M12345678 SOMEWORD 20-08-2019.pdf
Actual Result
The expression I use just returns this error when I use the REN command.
"The filename, directory name, or volume label syntax is incorrect."
Pretty sure you could do this in the command-line with sed as follows. Run the following command:
sed -i "s/(\d\d)\.(\d\d)\.(\d\d\d\d)/\1-\2-\3 ; s/^(.)\.(.*)/\1\2/" <YOUR_FILE>
The -i flag is for modification in place. Here, I'm running two separate search-and-replace functions on each line; the first is to match to a date, and then format it accordingly, and the second is to match your A.A metadata and format it accordigly.
More about capture groups here.

Mass rename in shell script

I have a bunch of files which are of this format:
blabla.log.YYYY.MM.DD
Where YYYY.MM.DD is something like (2016.01.18)
I have quite a few folders with about 1000 files in each, so I wanted to have a simple script to rename them. I want to rename them to
blabla.log
So basically, I'm just stripping the date at the end. Here is what I have:
for f in [a-zA-Z]*.log.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]; do
mv -v $f ${f#[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]};
done
This script outputs this:
mv: `blabla.log.2016.01.18' and `blabla.log.2016.01.18' are the same file
For more information:
I'm on windows, but I run this script in gitbash
For some reason, my gitbash doesn't recognize the "rename" command
Some regex patterns (like [0-9]{4} don't seem to work)
I'm really at a lost. Thanks.
EDIT: I need to rename every single file that has a date at the end and that is of the from: *.log.2016.01.18. They all need to keep their original names. All that should change is the removal of the date.
You have to use % instead of #: you want to remove from the end, not the start of your string.
Also, you're missing a . in what has to be removed, you don't want to end up with blabla.log..
Quoting the variable names prevents surprises when file names contain special characters.
Together:
mv -v "$f" "${f%.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]}"

Extracting group from regex in shell script using grep

I want to extract the output of a command run through shell script in a variable but I am not able to do it. I am using grep command for the same. Please help me in getting the desired output in a variable.
x=$(pwd)
pw=$(grep '\(.*\)/bin' $x)
echo "extracted is:"
echo $pw
The output of the pwd command is /opt/abc/bin/ and I want only /root/abc part of it. Thanks in advance.
Use dirname to get the path and not the last segment of the path.
You can use:
x=$(pwd)
pw=`dirname $x`
echo $pw
Or simply:
pw=`dirname $(pwd)`
echo $pw
All of what you're doing can be done in a single echo:
echo "${PWD%/*}"
$PWD variable represents current directory and %/* removes last / and part after last /.
For your case it will output: /root/abc
The second (and any subsequent) argument to grep is the name of a file to search, not a string to perform matching against.
Furthermore, grep prints the matching line or (with -o) the matching string, not whatever the parentheses captured. For that, you want a different tool.
Minimally fixing your code would be
x=$(pwd)
pw=$(printf '%s\n' "$x" | sed 's%\(.*\)/bin.*%\1%')
(If you only care about Bash, not other shells, you could do sed ... <<<"$x" without the explicit pipe; the syntax is also somewhat more satisfying.)
But of course, the shell has basic string manipulation functions built in.
pw=${x%/bin*}

Why does FINDSTR behave differently in powershell and cmd?

The following command pipes the output of echo to findstr and tries to match a regular expression with it. I use it to check if the echoed line only consists of (one or more) digits:
echo 123 | findstr /r /c:"^[0-9][0-9]*$"
The expected output of findstr is 123, which means that the expression could be matched with this string. The output is correct when I execute the command with powershell.exe.
Executing the command in cmd.exe however does not give a match. It only outputs an empty line and sets %ERRORLEVEL% to 1, which means that no match was found.
What causes the different behavior? Is there a way to make this command run correctly on cmd as well?
My OS is Windows 7 Professional, 64 Bit.
In Powershell the command echoes the string 123 to the pipeline and that matches your regular expression.
In cmd, your command echos 123<space> to the pipeline. The trailing space isn't allowed for in your regular expression so you don't get a match.
Try:
echo 123| findstr /r /c:"^[0-9][0-9]*$"
and it will work just fine. Or just switch entirely to Powershell and stop having to worry about the vagaries of cmd.exe.
Edit:
Yes, cmd and powershell handle parameters very differently.
With cmd all programs are passed a simple text command line. The processing that cmd performs is pretty minimal: it will terminate the command at | or &, removes i/o redirection and will substitute in any variables. Also of course it identifies the command and executes it. Any argument processing is done by the command itself, so a command can choose whether spaces separate arguments or what " characters mean. Mostly commands have a fairly common interpretation of these things but they can just do their own thing with the string they were given. echo does it's own thing.
Powershell on the other hand has a complex syntax for arguments. All of the argument parsing is done by Powershell. The parsed arguments are then passed to Powershell functions or cmdlets as a sequence of .Net objects: that means you aren't limited to just passing simple strings around. If the command turns out not to be a powershell command and runs externally it will attempt to convert the objects into a string and puts quotes round any arguments that have a space. Sometimes the conversion can be a bit confusing, but it does mean that something like this:
echo (1+1)
will echo 2 in Powershell where cmd would just echo the input string.
It is worth always remembering that with Powershell you are working with objects, so for example:
PS C:\> echo Today is (get-date)
Today
is
17 April 2014 20:03:15
PS C:\> echo "Today is $(get-date)"
Today is 04/17/2014 20:03:20
In the first case echo gets 3 objects, two strings and a date. It outputs each object on a separate line (and a blank line when the type changes). In the second case it gets a single object which is a string (and unlike the cmd echo it never sees the quote marks).

Unpredictable behavior in sed interpreters output from multiple expressions

Why does GNU sed sometimes handle substitution with piped output into another sed instance differently than when multiple expressions are used with the same one?
Specifically, for msys/mingw sessions, in the /etc/profile script I have a series of manipulations that "rearrange" the order of the environment variable PATH and removes duplicate entries.
Take note that while normally sed treats each line of input seperately (and therfore can't easily substitute '\n' in the input stream, this sed statement does a substitution of ':' with '\n', so it still handles the entire input stream like one line (with '\n' characters in it). This behavior stays true for all sed expressions in the same instance of sed (basically until you redirect or pipe the output into another program).
Here's the obligatory specs:
Windows 7 Professional Service Pack 1
HP Pavilion dv7-6b78us
16 GB DDR3 RAM
MinGW-w64 (x86_64-w64-mingw32-gcc-4.7.1.2-release-win64-rubenvb) mounted on /mingw/
MSYS (20111123) mounted on / and on /usr/
$ uname -a="MINGW32_NT-6.1 CHRIV-L09 1.0.17(0.48/3/2) 2011-04-24 23:39 i686 Msys"
$ which sed="/bin/sed.exe" (it's part of MSYS)
$ sed --version="GNU sed version 4.2.1"
This is the contents of PATH before manipulation:
PATH='.:/usr/local/bin:/mingw/bin:/bin:/c/PHP:/c/Program Files (x86)/HP SimplePass 2011/x64:/c/Program Files (x86)/HP SimplePass 2011:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/si:/c/android-sdk:/c/android-sdk/tools:/c/android-sdk/platform-tools:/c/Program Files (x86)/WinMerge:/c/ntp/bin:/c/GnuWin32/bin:/c/Program Files/MySQL/MySQL Server5.5/bin:/c/Program Files (x86)/WinSCP:/c/Program Files (x86)/Overlook Fing 2.1/bin:/c/Program Files/7-zip:.:/c/Program Files/TortoiseGit/bin:/c/Program Files (x86)/Git/bin:/c/VS10/VC/bin/x86_amd64:/c/VS10/VC/bin/amd64:/c/VS10/VC/bin'
This is an excerpt of /etc/profile (where I have begun the PATH manipulation):
set | grep --color=never ^PATH= | sed -e "s#^PATH=##" -e "s#'##g" \
-e "s/:/\n/g" -e "s#\n\(/[^\n]*tortoisegit[^\n]*\)#\nZ95-\1#ig" \
-e "s#\n\(/[a-z]/win\)#\nZ90-\1#ig" -e "s#\n\(/[a-z]/p\)#\nZ70-\1#ig" \
-e "s#\.\n#A10-.\n#g" -e "s#\n\(/usr/local/bin\)#\nA15-\1#ig" \
-e "s#\n\(/bin\)#\nA20-\1#ig" -e "s#\n\(/mingw/bin\)#\nA25-\1#ig" \
-e "s#\n\(/[a-z]/vs10/vc/bin\)#\nA40-\1#ig"
The last sed expression in that line basically looks for lines that begins with "/c/VS10/VC/bin" and prepends them with 'A40-' like this:
...
/c/si
A40-/c/VS10/VC/bin
A40-/c/VS10/VC/bin/amd64
A40-/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
I like my sed expressions to be flexible (path structures change), but I don't want it to match the lines that end with amd64 or x86_amd64 (those are going to have a different string prepended). So I change the last expression to:
-e "s#\n\(/[a-z]/vs10/vc/bin\)\n#\nA40-\1\n#ig"
This works:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
Then, (to match any "line" matching the pseudocode "/x/.../bin") I change the last expression to:
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
Which produces:
...
/c/si
/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
??? - sed didn't match any character ('.') any number of times ('*') in the middle of the line ???
But, if I pipe the output into a different instance of sed (and compensate for sed handling each "line" seperately) like this:
| sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
I get:
sed: -e expression #1, char 30: unterminated `s' command
??? How is that unterminated? It's got all three '#' characters after the s, has the modifiers 'i' and 'g' after the third '#', and the entire expression is in double quotes ('"'). Also, there are no escapes ('\') immediately preceding the delimiters, and the delimiter is not a part of either the search or the replacement. Let's try a different delimiter than '#', like '~':
I use:
| sed -e "s~^(/[a-z]/.*/bin)$~A40-\1~ig"
and, I get:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
A40-/c/GnuWin32/bin
...
And, that is correct! The only thing I changed was the delimeter from '#' to '~' and it worked ???
This is not (even close to) the first time that sed has produced unexplainable results for me.
Why, oh, why, is sed NOT matching syntax in an expression in the same instance, but IS matching when piped into another instance of sed?
And, why, oh, why, do I have to use a different delimeter when I do this (in order not to get an "unterminated 's' command"?
And the real reason I'm asking: Is this a bug in sed, OR, is it correct behavior that I don't understand (and if so, can someone explain why this behavior is correct)? I want to know if I'm doing it wrong, or if I need a different/better tool (or both, they don't have to be mutually exclusive).
I'll mark a response it as the answer if someone can either prove why this behavior is correct or if they can prove why it is a bug. I'll gladly accept any advice about other tools or different methods of using sed, but those won't answer the question.
I'm going to have to get better at other text processors (like awk, tr, etc.) because sed is costing me too much time with it's unexplainable results.
P.S. This is not the complete logic of my PATH manipulation. The complete logic also finishes prepending all the lines with values from 'A00-' to 'Z99-', then pipes that output into 'sort -u -f' and back into sed to remove those same prefixes on each line and to convert the lines ('\n') back into colons (':'). Then "export PATH='" is prepended to the single line and "'" is appended to it. Then that output is redirected into a temporary file. Next, that temporary file is sourced. And, finally, that temporary file is removed.
The /etc/profile script also displays the contents of PATH before and after sorting (in case it screwed up the path).
P.P.S. I'm sure there is a much better way to do this. It started as some very simple sed manipulations, and grew into the monster you see here. Even if there is a better way, I still need to know why sed is giving me these results.
sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
is unterminated because the shell is trying to expand "$#A". Put your expressions in single quotes to avoid this.
The expression
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
fails, or doesn't do what you expect, because . matches the newline in a multi-line expression. Check your whole output, the A40- is at the very beginning. Change it to
-e "s#\n\(/[a-z]/[^\n]*/bin\)\n#\nA40-\1\n#ig"
and it might be more what you expect. This may very well be the case with most of your issues with multi-line modifications.
You can also put the statements, one per line, into a standalone file and invoke sed with sed -f editscript. It might make maintenance of this a bit easier.