Why does grep matches all the lines no matter what the pattern - regex

I'm having a problem using grep.
I have a file http://pastebin.com/HxAcciCa that I want to check for certain patterns. And when I"m trying to search for it grep returns all the lines provided that the pattern already exists in the given file.
To explain more this is the code that I'm running
grep -F "ENVIRO" "$file_pos" >> blah
No matter what else I try even if I provide a whole line as a pattern bash always returns all the lines.
These are variations of what I'm trying:
grep -F "E20" "$file_pos" >> blah
grep E20 "$file_pos" >> blah
grep C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
grep -F C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
Also for some strange reasons when adding the -x option to grep, it doesn't return any line despite the fact that the exact pattern exists.
I've searched the web and the bash documentation for the cause but couldn't find anything.
My final test was the following
grep -F -C 1 "E20" "$store_pos" >> blah #store_pos has the same value as $file_pos
I thought maybe it was printing the lines after the result but that was not the case.
I was using the blah file to see the output.
Also I'm using Linux mint rebecca.
Finally although the naming is quite familiar this question is not similiar to Why does grep match all lines for the pattern "\'"
And finally I would like to say that I am new to bash.
I suspect The error might be due to the main file http://pastebin.com/HxAcciCa rather than the code?

From the comments, it appears that the file has carriage returns delimiting the lines, rather than the linefeeds that grep expects; as a result, grep sees the file as one huge line, that either matches or fails to match as a whole.
(Note: there are at least three different conventions about how to delimit the lines in a "plain text" file -- unix uses newline (\n), DOS/Windows uses carriage return followed by newline (\r\n), and pre-OSX versions of MacOS used just carriage return (\r).)
I'm not clear on how your file wound up in this format, but you can fix it easily with:
tr '\r' '\n' <badfile >goodfile
or on the fly with:
tr '\r' '\n' <badfile | grep ...

Check the line endings in your input file: file, wc -l.
Check you are indeed using the correct grep: which grep.
Use > to redirect the output, or | more or | less to not be confused by earlier attempts you are appending to.
Edit: Looks like your file has the wrong line endings (old Mac OS (CR) perhaps). If you have dos2unix you can try to convert them to Unix style line endings (LF).

I don't have access to a PC at the moment, but what could possibly help you troubleshoot:
1. Use grep --color -F to see if it matches correctly.
2. After your statement, use | cat -A to see if there's any surprising control characters, lines should end in $, any other characters like \I or \M can sometimes be a headache.
I suspect number 2 as it seems to be Windows output. In which case you can cat filename | dos2unix | grep stmt should solve it
Did you save the dos2unix output as another file?
Just double check the file, it should be similar to this:
[root#pro-mon9001 ~]# cat -A Test.txt
Windows^M$
Style^M$
Files^M$
Are^M$
Hard ^M$
To ^M$
Parse^M$
[root#pro-mon9001 ~]# dos2unix Test.txt
dos2unix: converting file Test.txt to Unix format ...
[root#pro-mon9001 ~]# cat -A Test.txt
Windows$
Style$
Files$
Are$
Hard$
To$
Parse$
Now it should parse properly - so just verify that it did convert the file properly
Good luck!

Related

How do I grep a string using the previous output for my next argument?

There is a string located within a file that starts with 4bceb and is 32 characters long.
To find it I tried the following
Input:
find / -type f 2>/dev/null | xargs grep "4bceb\w{27}" 2>/dev/null
after entering the command it seems like the script is awaiting some additional command.
Your command seems alright in principle, i.e. it should correctly execute the grep command for each file find returns. However, I don't believe your regular expression (respectively the way you call grep) is correct for what you want to achieve.
First, in order to get your expression to work, you need to tell grep that you are using Perl syntax by specifying the -P flag.
Second, your regexp will return the full lines that contain sequences starting with "4bceb" that are at least 32 characters long, but may be longer as well. If, for example your ./test.txt file contents were
4bcebUUUUUUUUUUUUUUUUUUUUUUUU31
4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
4bcebWWWWWWWWWWWWWWWWWWWWWWWWWW33
sometext4bcebYYYYYYYYYYYYYYYYYYYYYYYYY32somemoretext
othertext 4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32 evenmoretext
your output would include all lines except the first one (in which the sequence is shorter than 32 characters). If you actually want to limit your results to lines that just contain sequences that are exactly 32 characters long, you can use the -w flag (for word-regexp) with grep, which would only return lines 2 and 5 in the above example.
Third, if you only want the match but not the surrounding text, the grep flag -o will do exactly this.
And finally, you don't need to pipe the find output into xargs, as grep can directly do what you want:
grep -rnPow / -e "4bceb\w{27}"
will recursively (-r) scan all files starting from / and return just the ones that contain matching words, along with the matches (as well as the line numbers they were found in, as result of the flag -n):
./test.txt:2:4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
./test.txt:5:4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32

What is the difference b/w two sed commands below?

Information about the environment I am working in:
$ uname -a
AIX prd231 1 6 00C6B1F74C00
$ oslevel -s
6100-03-10-1119
Code Block A
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \
/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleA.txt
Code Block B
( grep schdCycCleanup $DCCS_LOG_FILE | sed 's/[~]/ \n/g' | grep 'Move(s) Exist for cycle' | sed 's/[^0-9]*//g' ) > cycleB.txt
I have two code blocks(shown above) that make use of sed to trim the input down to 6 digits but one command is behaving differently than I expected.
Sample of input for the two code blocks
Mar 25 14:06:16 prd231 ajbtux[33423660]: 20160325140616:~schd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT:~schdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341~ SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210~
I get the following output when the sample input above goes through the two code blocks.
cycleA.txt content
389210
cycleB.txt content
25140616231334236602016032514061610200705008341389210
I understand that my last piped sed command (sed 's/[^0-9]*//g') is deleting all characters other than numbers so I omitted it from the block codes and placed the output in two additional files. I get the following output.
cycleA1.txt content
SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210
cycleB1.txt content
Mar 25 15:27:58 prd231 ajbtux[33423660]: 20160325152758: nschd_cem_svr:1:0:SCHD-MSG-MOVEEXISTCYCLE:200705008:AUDIT: nschdCycCleanup - /apps/dccs/ajbtux/source/SCHD/schd_cycle_cleanup.c - line 341 n SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210 n
I can see that the first code block is removing every thing other that (SCHD_CYCLE_CLEANUP - Move(s) Exist for cycle 389210) and is using the tilde but the second code block is just replacing the tildes with the character n. I can also see that it is necessary in the first code block for a line break after this(sed 's/[~]/ ) and that is why I though having \n would simulate a line break but that is not the case. I think my different output results are because of the way regular expressions are being used. I have tried to look into regular expressions and searched about them on stackoverflow but did not obtain what I was looking for. Could someone explain how I can achieve the same result from code block B as code block A without having part of my code be on a second line?
Thank you in advance
This is an example of the XY problem (http://xyproblem.info/). You're asking for help to implement something that is the wrong solution to your problem. Why are you changing ~s to newlines, etc when all you need given your posted sample input and expected output is:
$ sed -n 's/.*schdCycCleanup.* \([0-9]*\).*/\1/p' file
389210
or:
$ awk -F'[ ~]' '/schdCycCleanup/{print $(NF-1)}' file
389210
If that's not all you need then please edit your question to clarify your requirements for WHAT you are trying to do (as opposed to HOW you are trying to do it) as your current approach is just wrong.
Etan Reisner's helpful answer explains the problem and offers a single-line solution based on an ANSI C-quoted string ($'...'), which is appropriate, given that you originally tagged your question bash.
(Ed Morton's helpful answer shows you how to bypass your problem altogether with a different approach that is both simpler and more efficient.)
However, it sounds like your shell is actually something different - presumably ksh88, an older version of the Korn shell that is the default sh on AIX 6.1 - in which such strings are not supported[1]
(ANSI C-quoted strings were introduced in ksh93, and are also supported not only in bash, but in zsh as well).
Thus, you have the following options:
With your current shell, you must stick with a two-line solution that contains an (\-escaped) actual newline, as in your code block A.
Note that $(printf '\n') to create a newline does not work, because command substitutions invariably trim all trailing newlines, resulting in the empty string in this case.
Use a more modern shell that supports ANSI C-quoted strings, and use Etan's answer. http://www.ibm.com/support/knowledgecenter/ssw_aix_61/com.ibm.aix.cmds3/ksh.htm tells me that ksh93 is available as an alternative shell on AIX 6.1, as /usr/bin/ksh93.
If feasible: install GNU sed, which natively understands escape sequences such as \n in replacement strings.
[1] As for what actually happens when you try echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g' in a POSIX-like shell that does not support $'...': the $ is left as-is, because what follow is not a valid variable name, and sed ends up seeing literal $s/[~]/\\\n/g, where the $ is interpreted as a context address applying to the last input line - which doesn't make a difference here, because there is only 1 line. \\ is interpreted as plain \, and \n as plain n, effectively replacing ~ instances with literal \n sequences.
GNU sed handles \n in the replacement the way you expect.
OS X (and presumably BSD) sed does not. It treats it as a normal escaped character and just unescapes it to n. (Though I don't see this in the manual anywhere at the moment.)
You can use $'' quoting to use \n as a literal newline if you want though.
echo 'foo~bar~baz' | sed $'s/[~]/\\\n/g'

Find non-ASCII codepoints in a file

I am currently using this regex to find the non-ASCII code points in a file, no matter what encoding:
$ cat test.txt | hd | grep -P " [8-9a-f][\da-f]"
Is there a better, more concise, or less hacky method? I usually use grep -P "[^\x00-\x7f]" to find the offensive characters but here I am looking for the offensive code points.
Note that the current hacky method does have the nice side effect of showing the surrounding ASCII characters, which is very nice for context.
Using hd, this should be faster:
hd test.txt |grep -w '[89a-f][0-9a-f]'
(grep -P invokes libpcre and is slower. grep -w searches just "words" and will default to standard posix regex, which is nearly as fast as a -F plain text query. Removing the cat from the pipe also saves (trivial) effort.)
If you didn't want the context, you could give grep the -o flag. If you want the context called out more clearly, consider --color (or even --color=always if you're piping the output somewhere and don't mind the coloring control characters). You may also find grep's -n flag useful, which will give you line numbers.
I think you can use grep's -a flag to achieve what you're looking for in a single command (this forces everything to be read as text rather than the useless "Binary file test.txt matches" output), though you may not like what the output does to your terminal. Maybe pipe it into a file and then view that file with vim (which, unlike less, won't render control characters):
grep -aP '[^\x00-\x7f]' test.txt > found-highchars
view found-highchars
This may or may not be faster than piping through hd and grep.

Unpredictable behavior in sed interpreters output from multiple expressions

Why does GNU sed sometimes handle substitution with piped output into another sed instance differently than when multiple expressions are used with the same one?
Specifically, for msys/mingw sessions, in the /etc/profile script I have a series of manipulations that "rearrange" the order of the environment variable PATH and removes duplicate entries.
Take note that while normally sed treats each line of input seperately (and therfore can't easily substitute '\n' in the input stream, this sed statement does a substitution of ':' with '\n', so it still handles the entire input stream like one line (with '\n' characters in it). This behavior stays true for all sed expressions in the same instance of sed (basically until you redirect or pipe the output into another program).
Here's the obligatory specs:
Windows 7 Professional Service Pack 1
HP Pavilion dv7-6b78us
16 GB DDR3 RAM
MinGW-w64 (x86_64-w64-mingw32-gcc-4.7.1.2-release-win64-rubenvb) mounted on /mingw/
MSYS (20111123) mounted on / and on /usr/
$ uname -a="MINGW32_NT-6.1 CHRIV-L09 1.0.17(0.48/3/2) 2011-04-24 23:39 i686 Msys"
$ which sed="/bin/sed.exe" (it's part of MSYS)
$ sed --version="GNU sed version 4.2.1"
This is the contents of PATH before manipulation:
PATH='.:/usr/local/bin:/mingw/bin:/bin:/c/PHP:/c/Program Files (x86)/HP SimplePass 2011/x64:/c/Program Files (x86)/HP SimplePass 2011:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/si:/c/android-sdk:/c/android-sdk/tools:/c/android-sdk/platform-tools:/c/Program Files (x86)/WinMerge:/c/ntp/bin:/c/GnuWin32/bin:/c/Program Files/MySQL/MySQL Server5.5/bin:/c/Program Files (x86)/WinSCP:/c/Program Files (x86)/Overlook Fing 2.1/bin:/c/Program Files/7-zip:.:/c/Program Files/TortoiseGit/bin:/c/Program Files (x86)/Git/bin:/c/VS10/VC/bin/x86_amd64:/c/VS10/VC/bin/amd64:/c/VS10/VC/bin'
This is an excerpt of /etc/profile (where I have begun the PATH manipulation):
set | grep --color=never ^PATH= | sed -e "s#^PATH=##" -e "s#'##g" \
-e "s/:/\n/g" -e "s#\n\(/[^\n]*tortoisegit[^\n]*\)#\nZ95-\1#ig" \
-e "s#\n\(/[a-z]/win\)#\nZ90-\1#ig" -e "s#\n\(/[a-z]/p\)#\nZ70-\1#ig" \
-e "s#\.\n#A10-.\n#g" -e "s#\n\(/usr/local/bin\)#\nA15-\1#ig" \
-e "s#\n\(/bin\)#\nA20-\1#ig" -e "s#\n\(/mingw/bin\)#\nA25-\1#ig" \
-e "s#\n\(/[a-z]/vs10/vc/bin\)#\nA40-\1#ig"
The last sed expression in that line basically looks for lines that begins with "/c/VS10/VC/bin" and prepends them with 'A40-' like this:
...
/c/si
A40-/c/VS10/VC/bin
A40-/c/VS10/VC/bin/amd64
A40-/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
I like my sed expressions to be flexible (path structures change), but I don't want it to match the lines that end with amd64 or x86_amd64 (those are going to have a different string prepended). So I change the last expression to:
-e "s#\n\(/[a-z]/vs10/vc/bin\)\n#\nA40-\1\n#ig"
This works:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
Then, (to match any "line" matching the pseudocode "/x/.../bin") I change the last expression to:
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
Which produces:
...
/c/si
/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
??? - sed didn't match any character ('.') any number of times ('*') in the middle of the line ???
But, if I pipe the output into a different instance of sed (and compensate for sed handling each "line" seperately) like this:
| sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
I get:
sed: -e expression #1, char 30: unterminated `s' command
??? How is that unterminated? It's got all three '#' characters after the s, has the modifiers 'i' and 'g' after the third '#', and the entire expression is in double quotes ('"'). Also, there are no escapes ('\') immediately preceding the delimiters, and the delimiter is not a part of either the search or the replacement. Let's try a different delimiter than '#', like '~':
I use:
| sed -e "s~^(/[a-z]/.*/bin)$~A40-\1~ig"
and, I get:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
A40-/c/GnuWin32/bin
...
And, that is correct! The only thing I changed was the delimeter from '#' to '~' and it worked ???
This is not (even close to) the first time that sed has produced unexplainable results for me.
Why, oh, why, is sed NOT matching syntax in an expression in the same instance, but IS matching when piped into another instance of sed?
And, why, oh, why, do I have to use a different delimeter when I do this (in order not to get an "unterminated 's' command"?
And the real reason I'm asking: Is this a bug in sed, OR, is it correct behavior that I don't understand (and if so, can someone explain why this behavior is correct)? I want to know if I'm doing it wrong, or if I need a different/better tool (or both, they don't have to be mutually exclusive).
I'll mark a response it as the answer if someone can either prove why this behavior is correct or if they can prove why it is a bug. I'll gladly accept any advice about other tools or different methods of using sed, but those won't answer the question.
I'm going to have to get better at other text processors (like awk, tr, etc.) because sed is costing me too much time with it's unexplainable results.
P.S. This is not the complete logic of my PATH manipulation. The complete logic also finishes prepending all the lines with values from 'A00-' to 'Z99-', then pipes that output into 'sort -u -f' and back into sed to remove those same prefixes on each line and to convert the lines ('\n') back into colons (':'). Then "export PATH='" is prepended to the single line and "'" is appended to it. Then that output is redirected into a temporary file. Next, that temporary file is sourced. And, finally, that temporary file is removed.
The /etc/profile script also displays the contents of PATH before and after sorting (in case it screwed up the path).
P.P.S. I'm sure there is a much better way to do this. It started as some very simple sed manipulations, and grew into the monster you see here. Even if there is a better way, I still need to know why sed is giving me these results.
sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
is unterminated because the shell is trying to expand "$#A". Put your expressions in single quotes to avoid this.
The expression
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
fails, or doesn't do what you expect, because . matches the newline in a multi-line expression. Check your whole output, the A40- is at the very beginning. Change it to
-e "s#\n\(/[a-z]/[^\n]*/bin\)\n#\nA40-\1\n#ig"
and it might be more what you expect. This may very well be the case with most of your issues with multi-line modifications.
You can also put the statements, one per line, into a standalone file and invoke sed with sed -f editscript. It might make maintenance of this a bit easier.

awk script to remove ASCII from file type

Here is a simple command
file * | awk '/ASCII text/ {gsub(/:/,"",$1); print $1}' | xargs chmod -x
I am not able to understand the use of awk in the above as showed.
How is it working?
There was a deleted answer which came pretty close to avoiding the problems with whitespace or colons in filenames and the output of file. I've voted to undelete the answer, but I'm going to go ahead and post some improvements to it and add some explanation.
file -0 * | awk -F '\0' '$2 ~ /ASCII text/ {print $1 "\0"}' | xargs -0 chmod -x
Since nulls aren't allowed in filenames, it's safe to use them as delimiters. Each step in this pipeline uses nulls. file outputs them, awk accepts them in input and outputs them and xargs accepts them in input. I've also made the match specific to the description field so it won't trigger a false positive in the perhaps unusual case of a file which is named something like "ASCII text" but in fact its contents are not.
As others have said, the AWK command you posted matches lines of output from the file command that include "ASCII text" somewhere in the line. Then every colon is deleted (since gsub() is a global substitution) from field one which is the colon-space-delimited filename. A potential problem occurs if the filename contains either a colon or a space (or both or multiples). The filename will get truncated and the chmod will fail or might even be falsely triggered on a file with a similar name (e.g. "foo bar" and "foo" both exist, "foo" is not an ASCII text file so you don't want it to be touched, but "foo bar" gets truncated to "foo" and oops!). The reason spaces are potential problems is that AWK, by default, does field splitting on spaces and tabs.
Breakdown of the AWK portion of the pipeline you posted:
/ASCII text/ { - for each line that matches the regular expression
gsub(/:/,"",$1); - for each colon (as a regular expression) in the first field, substitute an empty string
print $1} - print the thus modified first field
I'm guessing but it looks like it's extracting the part before the : in the output of the file command (i.e. the filename). The gsub part will remove the : in the filename and so something like foo.txt: ASCII text will become foo.txt ASCII text. Then, the print will print the first item in the space separated list (in this case, the filename foo.txt). All these files will be made unexecutable by the chmod.
This looks quite tedious. It's probably easier to just say awk -F: '{print $1}' after grepping instead of the whole substitution trick. Also, this will break if the filename has spaces in it.
It's using file to determine the type (contents) of each file, then selecting the ones that are ASCII text and removing everything from the first colon (which is assumed to be the separator between the filename and file type; this is fragile when file names have colons in them; as Noufel noted, it's also doing it the hard way), then using xargs to batch then up and clear the execute bits. (The usual reason for doing this is files transferred from Windows, which doesn't have execute bits so often all files end up with execute bits set as seen by Unixes.)
The breakage on spaces is fixable; xargs understands quoting. I would break on the last colon instead of the first, though, since file doesn't usually include colons in its ASCII text type strings.