Bullet Proof Text - regex

I have no idea what is going on, but grep, awk, sed have been rendered neutered in the face of a series of text files. Simply put, they wil not work. I cannot pattern match over a range with awk with proven personal and public examples. I cannot get sed to use the p or i options, but s still works. Both awk and sed have the odd behavior of simply printing everything regardless if the pattern matches or not. And grep will find a word (via regex or string) but if I go to exclude the word (-v) it erases everything. And I mean everything.
I don't think pasting code will be useful, but I am amenable to this. I don't think pasting text will work either.
Is there some super secret setting that renders these programs daft? I've made sure everything is saved in utf-8, and have run tr -d '\r\n' and its permutations over everything. This is a linux box and I am beating my head against the table.
All of this is being done in a linux environment with BASH.
Any ideas?

#GordonDavisson seems to have nailed it. Your tr -d '\r\n' turned your file into one long line without an ending newline so you should expect grep -v <something that appears in the file> to output nothing [at best] since everything's on one line and, while some will do their best with it, you shouldn't even expect UNIX tools to be able to handle it at all since it's not a valid text file without an ending newline. Look:
$ cat file
the
quick
brown
dog
$ grep bro file
brown
$ grep -v bro file
the
quick
dog
$ tr -d '\r\n' < file > file2
$ cat file2
thequickbrowndog$
$ grep bro file2
thequickbrowndog
$ grep -v bro file2
$
Not sure what you wanted to achieve with that tr so not sure what to advise you to do with the file now.

Related

Sed doesn't replace a pattern that is understood by gedit

I need to delete some content that is followed by 5 hyphens (that are in separate line) from 1000 files. Basically it looks like this:
SOME CONTENT
-----
SOME CONTENT TO BE DELETED WITH 5 HYPHENS ABOVE
I've tried to do that with this solution, but it didn't work for me:
this command — sed '/-----/,$ d' *.txt -i — can't be used because some of these texts have lines with more than 5 hyphens;
this command — sed '/^-----$/,$ d' *.txt -i — resulted in having all the files unchanged).
So I figured out that it might be something about "^" and "$" characters, but I am both sed and RegEx newbie, to be honest, and I don't know, what's the problem.
I've also found out that this RegEx — ^-{5}$(\s|\S)*$ — is good for capturing only these blocks which start exactly with 5 hyphens, but putting it into sed command gives no effect (both hyphens and text after them stay, where they were).
There's something I don't understand about sed probably, because when I use the above expression with gedit's Find&Replace, it works flawlessly. But I don't want to open, change and save 1000 files manually.
I am asking this question kinda again, because the given solution (the above link) didn't help me.
The first command I've posted (sed /-----/,$ d' *.txt -i) also resulted in deleting full content of some files, for instance a file that had 5 hyphens, new line with a single space (and no more text) at the bottom of it:
SOME CONTENT
-----
single space
EDIT:
Yes, I forgot about ' here, but in the Terminal I used these commands with it.
Yes, these files end with \n or \r. Is there a solution for it?
I think you want this:
sed '/^-\{5\}/,$ d' *.txt -i
Note that { and } need escaping.
$ sed '/^-----/p;q' file
SOME CONTENT
or
$ sed -E '/^-{5}/p;q' file
SOME CONTENT
Are you just trying to delete from ----- on it's own line (which may end with \r) to the end of the file? That'd be:
awk '{print} /^-----\r?/{exit}' file
The above will work using all awks in all shells in all UNIX systems.

Use of grep + sed based on a pattern file?

Here's the problem: i have ~35k files that might or might not contain one or more of the strings in a list of 300 lines containing a regex each
if I grep -rnwl 'C:\out\' --include=*.txt -E --file='comp.log' i see there are a few thousands of files that contain a match.
now how do i get sed to delete each line in these files containing the strings in comp.log used before?
edit: comp.log contains a simple regex in each line, but for the most part each string to be matched is unique
this is is an example of how it is structured:
server[0-9]\/files\/bobba fett.stw
[a-z]+ mochaccino
[2-9] CheeseCakes
...
etc. silly examples aside, it goes to show each line is unique save for a few variations so it shouldn't affect what i really want: see if any of these lines match the lines in the file being worked on. it's no different than 's/pattern/replacement/' except that i want to use the patterns in the file instead of inline.
Ok here's an update (S.O. gets inpatient if i don't declare the question answered after a few days)
after MUCH fiddling with the #Kenavoz/#Fischer approach, i found a totally different solution, but first things first.
creating a modified pattern list for sed to work with does work.
as well as #werkritter's approach of dropping sed altogether. (this one i find the most... err... "least convoluted" way around the problem).
I couldn't make #Mklement's answer work under windows/cygwin (it did work on under ubuntu, so...not sure what that means. figures.)
What ended up solving the problem in a more... long term, reusable form was a wonderful program pointed out by a colleage called PowerGrep. it really blows every other option out of the water. unfortunately it's windows only AND it's not free. (not even advertising here, the thing is not cheap, but it does solve the problem).
so considering #werkiter's reply was not a "proper" answer and i can't just choose both #Lars Fischer and #Kenavoz's answer as a solution (they complement each other), i am awarding #Kenavoz the tickmark for being first.
final thoughts: i was hoping for a simpler, universal and free solution but apparently there is not.
You can try this :
sed -f <(sed 's/^/\//g;s/$/\/d/g' comp.log) file > outputfile
All regex in comp.log are formatted to a sed address with a d command : /regex/d. This command deletes lines matching the patterns.
This internal sed is sent as a file (with process substitition) to the -f option of the external sed applied to file.
To delete just string matching the patterns (not all line) :
sed -f <(sed 's/^/s\//g;s/$/\/\/g/g' comp.log) file > outputfile
Update :
The command output is redirected to outputfile.
Some ideas but not a complete solution, as it requires some adopting to your script (not shown in the question).
I would convert comp.log into a sed script containing the necessary deletes:
cat comp.log | sed -r "s+(.*)+/\1/ d;+" > comp.sed`
That would make your example comp.sed look like:
/server[0-9]\/files\/bobba fett.stw/ d;
/[a-z]+ mochaccino/ d;
/[2-9] CheeseCakes/ d;
then I would apply the comp.sed script to each file reported by grep (With your -rnwl that would require some filtering to get the filename.):
sed -i.bak -f comp.sed $AFileReportedByGrep
If you have gnu sed, you can use -i inplace replacement creating a .bak backup, otherwise use piping to a temporary file
Both Kenavoz's answer and Lars Fischer's answer use the same ingenious approach:
transform the list of input regexes into a list of sed match-and-delete commands, passed as a file acting as the script to sed via -f.
To complement these answers with a single command that puts it all together, assuming you have GNU sed and your shell is bash, ksh, or zsh (to support <(...)):
find 'c:/out' -name '*.txt' -exec sed -i -r -f <(sed 's#.*#/\\<&\\>/d#' comp.log) {} +
find 'c:/out' -name '*.txt' matches all *.txt files in the subtree of dir. c:/out
-exec ... + passes as many matching files as will fit on a single command line to the specified command, typically resulting only in a single invocation.
sed -i updates the input files in-place (conceptually speaking - there are caveats); append a suffix (e.g., -i.bak) to save backups of the original files with that suffix.
sed -r activates support for extended regular expressions, which is what the input regexes are.
sed -f reads the script to execute from the specified filename, which in this case, as explained in Kenavoz's answer, uses a process substitution (<(...)) to make the enclosed sed command's output act like a [transient] file.
The s/// sed command - which uses alternative delimiter # to facilitate use of literal / - encloses each line from comp.log in /\<...\>/d to yield the desired deletion command; the enclosing of the input regex in \<...\>ensures matching as a word, as grep -w does.
This is the primary reason why GNU sed is required, because neither POSIX EREs (extended regular expressions) nor BSD/OSX sed support \< and \>.
However, you could make it work with BSD/OSX sed by replacing -r with -E, and \< / \> with [[:<:]] / [[:>:]]

Get all Commands without arguments from history (with Regex)

I have just started with learning shell commands and how to script in bash.
Now I like to solve the mentioned task in the title.
What I get from history command (without line numbers):
ls [options/arguments] | grep [options/arguments]
find [...] exec- sed [...]
du [...]; awk [...] file
And how my output should look like:
ls
grep
find
sed
du
awk
I already found a solution, but it doesn't really satisfy me. So far I declared three arrays, used the readarray -t << (...) command twice, in order to save the content from my history and after that, in combination with compgen -ac, to get all commands which I can possibly run. Then I compared the contents from both with loops, and saved the command every time it matched a line in the "history" array. A lot of effort for an simple exercise, I guess.
Another solution I thought of, is to do it with regex pattern matching.
A command usually starts at the beginning of the line, after a pipe, an execute or after a semicolon. And maybe more, I just don't know about yet.
So I need a regex which gives me only the next word after it matched one of these conditions. That's the command I've found and it seems to work:
grep -oP '(?<=|\s/)\w+'
Here it uses the pipe | as a condition. But I need to insert the others too. So I have put the pattern in double quotes, created an array with all conditions and tried it as recommend:
grep -oP "(?<=$condition\s/)\w+"
But no matter how I insert the variable, it fails. To keep it short, I couldn't figure out how the command works, especially not the regex part.
So, how can solve it using regular expressions? Or with a better approach than mine?
Thank you in advance! :-)
This is simple and works quite well
history -w /dev/stdout | cut -f1 -d ' '
You can use this awk with fc command:
awk '{print $1}' <(fc -nl)
find
mkdir
find
touch
tty
printf
find
ls
fc -nl lists entries from history without the line numbers.

Sed regex to find-replace version numbers

I'm new to sed, trying to write a script to find/replace text in a file. The file (test.txt) looks like this;
hello_world (1.2.0.123)
and I'm finding that this script (which I inherited):
sed -i 's/\(^\s*hello_world \)(.*)/\1hello_world (1.2.0.456)/' test.txt
is leading to;
hello_world hello_world (1.2.0.456)
when I need it to be
hello_world (1.2.0.456)
I'm not sure how to make the first part match only the parentheses, any assistance would be appreciated.
EDIT
The whitespace before the hello_world is important
The sed line is being auto-generated using variables etc. I'm looking for a way to make this regex work without changing that. The variables I have to play with are
variable1: hello_world
variable2: hello_world (1.2.0.456)
(hopefully it's obvious where these variables sat within the sed expression)
EDIT
I got this sorted in the end, answer below if anyone else is interested.
Got it
sed -i 's/\(^\s*\)phoenix_utils (.*)/\1phoenix_utils (1.0.0.28583)/' test.txt
sed -i -e 's/^\([[:blank:]]*hello_world \).*/\1(1.0.0.28583)/' YourFile
\1 is the content of first ( ) so \1Helloworld write it twice in your sample
be carefull with escape content depending of -e or not (behavior change and non GNU sed often need to escape (for grouping pattern)

How do I find broken NMEA log sentences with grep?

My GPS logger occassionally leaves "unfinished" lines at the end of the log files. I think they're only at the end, but I want to check all lines just in case.
A sample complete sentence looks like:
$GPRMC,005727.000,A,3751.9418,S,14502.2569,E,0.00,339.17,210808,,,A*76
The line should start with a $ sign, and end with an * and a two character hex checksum. I don't care if the checksum is correct, just that it's present. It also needs to ignore "ADVER" sentences which don't have the checksum and are at the start of every file.
The following Python code might work:
import re
from path import path
nmea = re.compile("^\$.+\*[0-9A-F]{2}$")
for log in path("gpslogs").files("*.log"):
for line in log.lines():
if not nmea.match(line) and not "ADVER" in line:
print "%s\n\t%s\n" % (log, line)
Is there a way to do that with grep or awk or something simple? I haven't really figured out how to get grep to do what I want.
Update: Thanks #Motti and #Paul, I was able to get the following to do almost what I wanted, but had to use single quotes and remove the trailing $ before it would work:
grep -nvE '^\$.*\*[0-9A-F]{2}' *.log | grep -v ADVER | grep -v ADPMB
Two further questions arise, how can I make it ignore blank lines? And can I combine the last two greps?
The minimum of testing shows that this should do it:
grep -Ev "^\$.*\*[0-9A-Fa-f]{2}$" a.txt | grep -v ADVER
-E use extended regexp
-v Show lines that do not match
^ starts with
.* anything
\* an asterisk
[0-9A-Fa-f] hexadecimal digit
{2} exactly two of the previous
$ end of line
| grep -v ADVER weed out the ADVER lines
HTH, Motti.
#Motti's answer doesn't ignore ADVER lines, but you easily pipe the results of that grep to another:
grep -Ev "^\$.*\*[0-9A-Fa-f]{2}$" a.txt |grep -v ADVER
#Tom (rephrased) I had to remove the trailing $ for it to work
Removing the $ means that the line may end with something else (e.g. the following will be accepted)
$GPRMC,005727.000,A,3751.9418,S,14502.2569,E,0.00,339.17,210808,,,A*76xxx
#Tom And can I combine the last two greps?
grep -Ev "ADVER|ADPMB"
#Motti: Combining the greps isn't working, it's having no effect.
I understand that without the trailing $ something else may folow the checksum & still match, but it didn't work at all with it so I had no choice...
GNU grep 2.5.3 and GNU bash 3.2.39(1) if that makes any difference.
And it looks like the log files are using DOS line-breaks (CR+LF). Does grep need a switch to handle that properly?
#Tom
GNU grep 2.5.3 and GNU bash 3.2.39(1) if that makes any difference.
And it looks like the log files are using DOS line-breaks (CR+LF). Does grep need a switch to handle that properly?
I'm using grep (GNU grep) 2.4.2 on Windows (for shame!) and it works for me (and DOS line-breaks are naturally accepted) , I don't really have access to other OSs at the moment so I'm sorry but I won't be able to help you any further :o(