Grep returning regex results in recursive search - regex

I've constructed a grep command that I use to search recursively through a directory of files for a pattern within them. The problem is that grep only returns back the file names the pattern is in, not the exact match of the pattern. How do I return the actual result?
Example:
File somefile.bin contains somestring0987654321�123�45� in a directory with one million other files
Command:
$ grep -EsniR -A 1 -B 1 '([a-zA-Z0-9]+)\x00([0-9]+)\x00([0-9]+)\x00' *
Current result:
Binary file somefile.bin matches
The desired result (or close to it):
Binary file somefile.bin matches
<line above match>
somestring0987654321�123�45�
<line below match>

You can try the -a option:
File and Directory Selection
-a, --text
Process a binary file as if it were text; this is equivalent to
the --binary-files=text option.
--binary-files=TYPE
If the first few bytes of a file indicate that the file contains
binary data, assume that the file is of type TYPE. By default,
TYPE is binary, and grep normally outputs either a one-line
message saying that a binary file matches, or no message if
there is no match. If TYPE is without-match, grep assumes that
a binary file does not match; this is equivalent to the -I
option. If TYPE is text, grep processes a binary file as if it
were text; this is equivalent to the -a option. Warning: grep
--binary-files=text might output binary garbage, which can have
nasty side effects if the output is a terminal and if the
terminal driver interprets some of it as commands.
But the problem is that in binary files there are no lines, so I'm not sure what you'd want the output to look like. You'll see random garbage, maybe the whole file, some special characters messing with your terminal may be printed.
If you want to restrict the output to the match itself, consider the -o option:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
The context control is limited to adding a certain number of lines before or after the match, which will probably not work well here. So if you want a context of certain number of bytes, you'll have to change the pattern itself.

Try...
grep -rnw "<regex>" <folder>
Much easier. More examples here --> https://computingbro.com/2020/05/10/word-search-in-linux-unix-filesystem/

Related

How do I grep a string using the previous output for my next argument?

There is a string located within a file that starts with 4bceb and is 32 characters long.
To find it I tried the following
Input:
find / -type f 2>/dev/null | xargs grep "4bceb\w{27}" 2>/dev/null
after entering the command it seems like the script is awaiting some additional command.
Your command seems alright in principle, i.e. it should correctly execute the grep command for each file find returns. However, I don't believe your regular expression (respectively the way you call grep) is correct for what you want to achieve.
First, in order to get your expression to work, you need to tell grep that you are using Perl syntax by specifying the -P flag.
Second, your regexp will return the full lines that contain sequences starting with "4bceb" that are at least 32 characters long, but may be longer as well. If, for example your ./test.txt file contents were
4bcebUUUUUUUUUUUUUUUUUUUUUUUU31
4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
4bcebWWWWWWWWWWWWWWWWWWWWWWWWWW33
sometext4bcebYYYYYYYYYYYYYYYYYYYYYYYYY32somemoretext
othertext 4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32 evenmoretext
your output would include all lines except the first one (in which the sequence is shorter than 32 characters). If you actually want to limit your results to lines that just contain sequences that are exactly 32 characters long, you can use the -w flag (for word-regexp) with grep, which would only return lines 2 and 5 in the above example.
Third, if you only want the match but not the surrounding text, the grep flag -o will do exactly this.
And finally, you don't need to pipe the find output into xargs, as grep can directly do what you want:
grep -rnPow / -e "4bceb\w{27}"
will recursively (-r) scan all files starting from / and return just the ones that contain matching words, along with the matches (as well as the line numbers they were found in, as result of the flag -n):
./test.txt:2:4bcebVVVVVVVVVVVVVVVVVVVVVVVVV32
./test.txt:5:4bcebZZZZZZZZZZZZZZZZZZZZZZZZZ32

How can I combine multiple text files, remove duplicate lines and split the remaining lines into several files of certain length?

I have a lot of relatively small files with about 350.000 lines of text.
For example:
File 1:
asdf
wetwert
ddghr
vbnd
...
sdfre
File 2:
erye
yren
asdf
jkdt
...
uory
As you can see line 3 of file 2 is a duplicate of line 1 in file 1.
I want a program / Notepad++ Plugin that can check and remove these duplicates in multiple files.
The next problem I have is that I want all lists to be combined into large 1.000.000 line files.
So, for example, I have these files:
648563 lines
375924 lines
487036 lines
I want them to result in these files:
1.000.000 lines
511.523 lines
And the last 2 files must consist of only unique lines.
How can I possibly do this? Can I use some programs for this? Or a combination of multiple Notepad++ Plugins?
I know GSplit can split files of 1.536.243 into files of 1.000.000 and 536.243 lines, but that is not enough, and it doesn't remove duplicates.
I do want to create my own Notepad++ plugin or program if needed, but I have no idea how and where to start.
Thanks in advance.
You have asked about Notepad++ and are thus using Windows. On the other hand, you said you want to create a program if needed, so I guess the main goal is to get the job done.
This answer uses Unix tools - on Windows, you can get those with Cygwin.
To run the commands, you have to type (or paste) them in the terminal / console.
cat file1 file2 file3 | sort -u | split -l1000000 - outfile_
cat reads the files and echoes them; normally, to the screen, but the pipe | gets the output of the command left to it and pipes it through to the command on the right.
sort obviously sorts them, and the switch -u tells it to remove duplicate lines.
The output is then piped to split which is being told to split after 1000000 lines by the switch -l1000000. The - (with spaces around) tells it to read its input not from a file but from "standard input"; the output in sort -u in this case. The last word, outfile_, can be changed by you, if you want.
Written like it is, this will result in files like outfile_aa, outfile_ab and so on - you can modify this with the last word in this command.
If you have all the files in on directory, and nothing else is in there, you can use * instead of listing all the files:
cat * | sort -u | split -l1000000 - outfile_
If the files might contain empty lines, you might want to remove them. Otherwise, they'll be sorted to the top and your first file will not have the full 1.000.000 values:
cat file1 file2 file3 | grep -v '^\s*$' | sort -u | split -l1000000 - outfile_
This will also remove lines that consist only of whitespace.
grep filters input using regular expressions. -v inverts the filter; normally, grep keeps only lines that match. Now, it keeps only lines that don't match. ^\s*$ matches all lines that consist of nothing else than 0 or more characters of whitespace (like spaces or tabs).
If you need to do this regularly, you can write a script so you don't have to remember the details:
#!/bin/sh
cat * | sort -u | split -l1000000 - outfile_
Save this as a file (for example combine.sh) and run it with
./combine.sh

Why does grep matches all the lines no matter what the pattern

I'm having a problem using grep.
I have a file http://pastebin.com/HxAcciCa that I want to check for certain patterns. And when I"m trying to search for it grep returns all the lines provided that the pattern already exists in the given file.
To explain more this is the code that I'm running
grep -F "ENVIRO" "$file_pos" >> blah
No matter what else I try even if I provide a whole line as a pattern bash always returns all the lines.
These are variations of what I'm trying:
grep -F "E20" "$file_pos" >> blah
grep E20 "$file_pos" >> blah
grep C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
grep -F C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
Also for some strange reasons when adding the -x option to grep, it doesn't return any line despite the fact that the exact pattern exists.
I've searched the web and the bash documentation for the cause but couldn't find anything.
My final test was the following
grep -F -C 1 "E20" "$store_pos" >> blah #store_pos has the same value as $file_pos
I thought maybe it was printing the lines after the result but that was not the case.
I was using the blah file to see the output.
Also I'm using Linux mint rebecca.
Finally although the naming is quite familiar this question is not similiar to Why does grep match all lines for the pattern "\'"
And finally I would like to say that I am new to bash.
I suspect The error might be due to the main file http://pastebin.com/HxAcciCa rather than the code?
From the comments, it appears that the file has carriage returns delimiting the lines, rather than the linefeeds that grep expects; as a result, grep sees the file as one huge line, that either matches or fails to match as a whole.
(Note: there are at least three different conventions about how to delimit the lines in a "plain text" file -- unix uses newline (\n), DOS/Windows uses carriage return followed by newline (\r\n), and pre-OSX versions of MacOS used just carriage return (\r).)
I'm not clear on how your file wound up in this format, but you can fix it easily with:
tr '\r' '\n' <badfile >goodfile
or on the fly with:
tr '\r' '\n' <badfile | grep ...
Check the line endings in your input file: file, wc -l.
Check you are indeed using the correct grep: which grep.
Use > to redirect the output, or | more or | less to not be confused by earlier attempts you are appending to.
Edit: Looks like your file has the wrong line endings (old Mac OS (CR) perhaps). If you have dos2unix you can try to convert them to Unix style line endings (LF).
I don't have access to a PC at the moment, but what could possibly help you troubleshoot:
1. Use grep --color -F to see if it matches correctly.
2. After your statement, use | cat -A to see if there's any surprising control characters, lines should end in $, any other characters like \I or \M can sometimes be a headache.
I suspect number 2 as it seems to be Windows output. In which case you can cat filename | dos2unix | grep stmt should solve it
Did you save the dos2unix output as another file?
Just double check the file, it should be similar to this:
[root#pro-mon9001 ~]# cat -A Test.txt
Windows^M$
Style^M$
Files^M$
Are^M$
Hard ^M$
To ^M$
Parse^M$
[root#pro-mon9001 ~]# dos2unix Test.txt
dos2unix: converting file Test.txt to Unix format ...
[root#pro-mon9001 ~]# cat -A Test.txt
Windows$
Style$
Files$
Are$
Hard$
To$
Parse$
Now it should parse properly - so just verify that it did convert the file properly
Good luck!

Unpredictable behavior in sed interpreters output from multiple expressions

Why does GNU sed sometimes handle substitution with piped output into another sed instance differently than when multiple expressions are used with the same one?
Specifically, for msys/mingw sessions, in the /etc/profile script I have a series of manipulations that "rearrange" the order of the environment variable PATH and removes duplicate entries.
Take note that while normally sed treats each line of input seperately (and therfore can't easily substitute '\n' in the input stream, this sed statement does a substitution of ':' with '\n', so it still handles the entire input stream like one line (with '\n' characters in it). This behavior stays true for all sed expressions in the same instance of sed (basically until you redirect or pipe the output into another program).
Here's the obligatory specs:
Windows 7 Professional Service Pack 1
HP Pavilion dv7-6b78us
16 GB DDR3 RAM
MinGW-w64 (x86_64-w64-mingw32-gcc-4.7.1.2-release-win64-rubenvb) mounted on /mingw/
MSYS (20111123) mounted on / and on /usr/
$ uname -a="MINGW32_NT-6.1 CHRIV-L09 1.0.17(0.48/3/2) 2011-04-24 23:39 i686 Msys"
$ which sed="/bin/sed.exe" (it's part of MSYS)
$ sed --version="GNU sed version 4.2.1"
This is the contents of PATH before manipulation:
PATH='.:/usr/local/bin:/mingw/bin:/bin:/c/PHP:/c/Program Files (x86)/HP SimplePass 2011/x64:/c/Program Files (x86)/HP SimplePass 2011:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/si:/c/android-sdk:/c/android-sdk/tools:/c/android-sdk/platform-tools:/c/Program Files (x86)/WinMerge:/c/ntp/bin:/c/GnuWin32/bin:/c/Program Files/MySQL/MySQL Server5.5/bin:/c/Program Files (x86)/WinSCP:/c/Program Files (x86)/Overlook Fing 2.1/bin:/c/Program Files/7-zip:.:/c/Program Files/TortoiseGit/bin:/c/Program Files (x86)/Git/bin:/c/VS10/VC/bin/x86_amd64:/c/VS10/VC/bin/amd64:/c/VS10/VC/bin'
This is an excerpt of /etc/profile (where I have begun the PATH manipulation):
set | grep --color=never ^PATH= | sed -e "s#^PATH=##" -e "s#'##g" \
-e "s/:/\n/g" -e "s#\n\(/[^\n]*tortoisegit[^\n]*\)#\nZ95-\1#ig" \
-e "s#\n\(/[a-z]/win\)#\nZ90-\1#ig" -e "s#\n\(/[a-z]/p\)#\nZ70-\1#ig" \
-e "s#\.\n#A10-.\n#g" -e "s#\n\(/usr/local/bin\)#\nA15-\1#ig" \
-e "s#\n\(/bin\)#\nA20-\1#ig" -e "s#\n\(/mingw/bin\)#\nA25-\1#ig" \
-e "s#\n\(/[a-z]/vs10/vc/bin\)#\nA40-\1#ig"
The last sed expression in that line basically looks for lines that begins with "/c/VS10/VC/bin" and prepends them with 'A40-' like this:
...
/c/si
A40-/c/VS10/VC/bin
A40-/c/VS10/VC/bin/amd64
A40-/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
I like my sed expressions to be flexible (path structures change), but I don't want it to match the lines that end with amd64 or x86_amd64 (those are going to have a different string prepended). So I change the last expression to:
-e "s#\n\(/[a-z]/vs10/vc/bin\)\n#\nA40-\1\n#ig"
This works:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
Then, (to match any "line" matching the pseudocode "/x/.../bin") I change the last expression to:
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
Which produces:
...
/c/si
/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
??? - sed didn't match any character ('.') any number of times ('*') in the middle of the line ???
But, if I pipe the output into a different instance of sed (and compensate for sed handling each "line" seperately) like this:
| sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
I get:
sed: -e expression #1, char 30: unterminated `s' command
??? How is that unterminated? It's got all three '#' characters after the s, has the modifiers 'i' and 'g' after the third '#', and the entire expression is in double quotes ('"'). Also, there are no escapes ('\') immediately preceding the delimiters, and the delimiter is not a part of either the search or the replacement. Let's try a different delimiter than '#', like '~':
I use:
| sed -e "s~^(/[a-z]/.*/bin)$~A40-\1~ig"
and, I get:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
A40-/c/GnuWin32/bin
...
And, that is correct! The only thing I changed was the delimeter from '#' to '~' and it worked ???
This is not (even close to) the first time that sed has produced unexplainable results for me.
Why, oh, why, is sed NOT matching syntax in an expression in the same instance, but IS matching when piped into another instance of sed?
And, why, oh, why, do I have to use a different delimeter when I do this (in order not to get an "unterminated 's' command"?
And the real reason I'm asking: Is this a bug in sed, OR, is it correct behavior that I don't understand (and if so, can someone explain why this behavior is correct)? I want to know if I'm doing it wrong, or if I need a different/better tool (or both, they don't have to be mutually exclusive).
I'll mark a response it as the answer if someone can either prove why this behavior is correct or if they can prove why it is a bug. I'll gladly accept any advice about other tools or different methods of using sed, but those won't answer the question.
I'm going to have to get better at other text processors (like awk, tr, etc.) because sed is costing me too much time with it's unexplainable results.
P.S. This is not the complete logic of my PATH manipulation. The complete logic also finishes prepending all the lines with values from 'A00-' to 'Z99-', then pipes that output into 'sort -u -f' and back into sed to remove those same prefixes on each line and to convert the lines ('\n') back into colons (':'). Then "export PATH='" is prepended to the single line and "'" is appended to it. Then that output is redirected into a temporary file. Next, that temporary file is sourced. And, finally, that temporary file is removed.
The /etc/profile script also displays the contents of PATH before and after sorting (in case it screwed up the path).
P.P.S. I'm sure there is a much better way to do this. It started as some very simple sed manipulations, and grew into the monster you see here. Even if there is a better way, I still need to know why sed is giving me these results.
sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
is unterminated because the shell is trying to expand "$#A". Put your expressions in single quotes to avoid this.
The expression
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
fails, or doesn't do what you expect, because . matches the newline in a multi-line expression. Check your whole output, the A40- is at the very beginning. Change it to
-e "s#\n\(/[a-z]/[^\n]*/bin\)\n#\nA40-\1\n#ig"
and it might be more what you expect. This may very well be the case with most of your issues with multi-line modifications.
You can also put the statements, one per line, into a standalone file and invoke sed with sed -f editscript. It might make maintenance of this a bit easier.

awk script to remove ASCII from file type

Here is a simple command
file * | awk '/ASCII text/ {gsub(/:/,"",$1); print $1}' | xargs chmod -x
I am not able to understand the use of awk in the above as showed.
How is it working?
There was a deleted answer which came pretty close to avoiding the problems with whitespace or colons in filenames and the output of file. I've voted to undelete the answer, but I'm going to go ahead and post some improvements to it and add some explanation.
file -0 * | awk -F '\0' '$2 ~ /ASCII text/ {print $1 "\0"}' | xargs -0 chmod -x
Since nulls aren't allowed in filenames, it's safe to use them as delimiters. Each step in this pipeline uses nulls. file outputs them, awk accepts them in input and outputs them and xargs accepts them in input. I've also made the match specific to the description field so it won't trigger a false positive in the perhaps unusual case of a file which is named something like "ASCII text" but in fact its contents are not.
As others have said, the AWK command you posted matches lines of output from the file command that include "ASCII text" somewhere in the line. Then every colon is deleted (since gsub() is a global substitution) from field one which is the colon-space-delimited filename. A potential problem occurs if the filename contains either a colon or a space (or both or multiples). The filename will get truncated and the chmod will fail or might even be falsely triggered on a file with a similar name (e.g. "foo bar" and "foo" both exist, "foo" is not an ASCII text file so you don't want it to be touched, but "foo bar" gets truncated to "foo" and oops!). The reason spaces are potential problems is that AWK, by default, does field splitting on spaces and tabs.
Breakdown of the AWK portion of the pipeline you posted:
/ASCII text/ { - for each line that matches the regular expression
gsub(/:/,"",$1); - for each colon (as a regular expression) in the first field, substitute an empty string
print $1} - print the thus modified first field
I'm guessing but it looks like it's extracting the part before the : in the output of the file command (i.e. the filename). The gsub part will remove the : in the filename and so something like foo.txt: ASCII text will become foo.txt ASCII text. Then, the print will print the first item in the space separated list (in this case, the filename foo.txt). All these files will be made unexecutable by the chmod.
This looks quite tedious. It's probably easier to just say awk -F: '{print $1}' after grepping instead of the whole substitution trick. Also, this will break if the filename has spaces in it.
It's using file to determine the type (contents) of each file, then selecting the ones that are ASCII text and removing everything from the first colon (which is assumed to be the separator between the filename and file type; this is fragile when file names have colons in them; as Noufel noted, it's also doing it the hard way), then using xargs to batch then up and clear the execute bits. (The usual reason for doing this is files transferred from Windows, which doesn't have execute bits so often all files end up with execute bits set as seen by Unixes.)
The breakage on spaces is fixable; xargs understands quoting. I would break on the last colon instead of the first, though, since file doesn't usually include colons in its ASCII text type strings.