Find non-ASCII codepoints in a file - regex

I am currently using this regex to find the non-ASCII code points in a file, no matter what encoding:
$ cat test.txt | hd | grep -P " [8-9a-f][\da-f]"
Is there a better, more concise, or less hacky method? I usually use grep -P "[^\x00-\x7f]" to find the offensive characters but here I am looking for the offensive code points.
Note that the current hacky method does have the nice side effect of showing the surrounding ASCII characters, which is very nice for context.

Using hd, this should be faster:
hd test.txt |grep -w '[89a-f][0-9a-f]'
(grep -P invokes libpcre and is slower. grep -w searches just "words" and will default to standard posix regex, which is nearly as fast as a -F plain text query. Removing the cat from the pipe also saves (trivial) effort.)
If you didn't want the context, you could give grep the -o flag. If you want the context called out more clearly, consider --color (or even --color=always if you're piping the output somewhere and don't mind the coloring control characters). You may also find grep's -n flag useful, which will give you line numbers.
I think you can use grep's -a flag to achieve what you're looking for in a single command (this forces everything to be read as text rather than the useless "Binary file test.txt matches" output), though you may not like what the output does to your terminal. Maybe pipe it into a file and then view that file with vim (which, unlike less, won't render control characters):
grep -aP '[^\x00-\x7f]' test.txt > found-highchars
view found-highchars
This may or may not be faster than piping through hd and grep.

Related

Why does grep matches all the lines no matter what the pattern

I'm having a problem using grep.
I have a file http://pastebin.com/HxAcciCa that I want to check for certain patterns. And when I"m trying to search for it grep returns all the lines provided that the pattern already exists in the given file.
To explain more this is the code that I'm running
grep -F "ENVIRO" "$file_pos" >> blah
No matter what else I try even if I provide a whole line as a pattern bash always returns all the lines.
These are variations of what I'm trying:
grep -F "E20" "$file_pos" >> blah
grep E20 "$file_pos" >> blah
grep C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
grep -F C:\E20-II\ENVIRO\SSNHapACS480.dll "$file_pos" >> blah
Also for some strange reasons when adding the -x option to grep, it doesn't return any line despite the fact that the exact pattern exists.
I've searched the web and the bash documentation for the cause but couldn't find anything.
My final test was the following
grep -F -C 1 "E20" "$store_pos" >> blah #store_pos has the same value as $file_pos
I thought maybe it was printing the lines after the result but that was not the case.
I was using the blah file to see the output.
Also I'm using Linux mint rebecca.
Finally although the naming is quite familiar this question is not similiar to Why does grep match all lines for the pattern "\'"
And finally I would like to say that I am new to bash.
I suspect The error might be due to the main file http://pastebin.com/HxAcciCa rather than the code?
From the comments, it appears that the file has carriage returns delimiting the lines, rather than the linefeeds that grep expects; as a result, grep sees the file as one huge line, that either matches or fails to match as a whole.
(Note: there are at least three different conventions about how to delimit the lines in a "plain text" file -- unix uses newline (\n), DOS/Windows uses carriage return followed by newline (\r\n), and pre-OSX versions of MacOS used just carriage return (\r).)
I'm not clear on how your file wound up in this format, but you can fix it easily with:
tr '\r' '\n' <badfile >goodfile
or on the fly with:
tr '\r' '\n' <badfile | grep ...
Check the line endings in your input file: file, wc -l.
Check you are indeed using the correct grep: which grep.
Use > to redirect the output, or | more or | less to not be confused by earlier attempts you are appending to.
Edit: Looks like your file has the wrong line endings (old Mac OS (CR) perhaps). If you have dos2unix you can try to convert them to Unix style line endings (LF).
I don't have access to a PC at the moment, but what could possibly help you troubleshoot:
1. Use grep --color -F to see if it matches correctly.
2. After your statement, use | cat -A to see if there's any surprising control characters, lines should end in $, any other characters like \I or \M can sometimes be a headache.
I suspect number 2 as it seems to be Windows output. In which case you can cat filename | dos2unix | grep stmt should solve it
Did you save the dos2unix output as another file?
Just double check the file, it should be similar to this:
[root#pro-mon9001 ~]# cat -A Test.txt
Windows^M$
Style^M$
Files^M$
Are^M$
Hard ^M$
To ^M$
Parse^M$
[root#pro-mon9001 ~]# dos2unix Test.txt
dos2unix: converting file Test.txt to Unix format ...
[root#pro-mon9001 ~]# cat -A Test.txt
Windows$
Style$
Files$
Are$
Hard$
To$
Parse$
Now it should parse properly - so just verify that it did convert the file properly
Good luck!

Converting LaTeX pmatrix command to amsmath pmatrix environment using sed

I have an old LaTeX document (with a lot of formatting commands) that I want to convert to the more modern LaTeX (I want to do the update for several reasons, not the least of which is to reduce the coupling between content and formatting). At any rate, the document has a lot of calls to the deprecated command \pmatrix{ .... } which I would like to replace with the new amsmath command \begin{pmatrix} ... \end{pmatrix}. I have been trying to use sed to do this conversion but I have never used it before and I am having trouble.
Here is a MWE
LaTeX input string
\pmatrix{0&0\cr \frac{1}{2}&0\cr 0&0\cr}\pmatrix{1&1\cr 1&1\cr 1&1\cr}
with the expected output
\begin{pmatrix}0&0\\ \frac{1}{2}&0\\ 0&0\end{pmatrix}\begin{pmatrix}1&1\\ 1&1\\ 1&1\end{pmatrix}
The commands that I have been trying to use are variants of the following
sed 's/\\pmatrix{\(.*\cr[ ]*\)}/\\begin{pmatrix}\1 \\end{pmatrix}/g' <$WORKING_FILE >$OUTPUT_FILE
but the closest output that I have been able to achieve is
\begin{pmatrix}0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix}
I am pretty sure that the problem is related to having two calls to pmatrix side by side, but I am not sure how to modify the regex to make this work.
I have searched google, but being so new to regex, I just got confused by all of the variations out there and which to use, and how to properly format such a thing.
The following might work for you:
sed -re 's/(\\pmatrix)\{([^}]*)}/\\begin{pmatrix}\2\\end{pmatrix}/g' -e 's/\\cr/\\\\/g' -e 's/\\\\\\end/\\end/g' inputfile
This works by:
substituting \pmatrix{...} with `\begin{matrix}...\end{matrix}
substituting \cr with \\
handling \\\end to make it \end
EDIT: As per your update, you might be better off splitting the relevant parts using grep before piping to sed:
grep -oP '\\pmatrix.*?\\cr}' inputfile | sed -re 's/\\pmatrix\{(.*)}/\\begin{pmatrix}\1\\end{pmatrix}/g;s/\\cr/\\\\/g;s/\\\\\\end/\\end/g'
This might work for you (GNU sed):
sed -r 's/\\cr/\n/g;s/\\(pmatrix)\{([^\n]*)\n([^\n]*)\n([^\n]*)\n\}/\\begin{\1}\2\\\\ \3\\\\ \4\\end{\1}/g;s/\n/\\cr/g' file
Convert \\cr to newlines. Do a global substitution command. Then convert those newlines left back to \\cr's.

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.
Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'
The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)
Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'
This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'

Unpredictable behavior in sed interpreters output from multiple expressions

Why does GNU sed sometimes handle substitution with piped output into another sed instance differently than when multiple expressions are used with the same one?
Specifically, for msys/mingw sessions, in the /etc/profile script I have a series of manipulations that "rearrange" the order of the environment variable PATH and removes duplicate entries.
Take note that while normally sed treats each line of input seperately (and therfore can't easily substitute '\n' in the input stream, this sed statement does a substitution of ':' with '\n', so it still handles the entire input stream like one line (with '\n' characters in it). This behavior stays true for all sed expressions in the same instance of sed (basically until you redirect or pipe the output into another program).
Here's the obligatory specs:
Windows 7 Professional Service Pack 1
HP Pavilion dv7-6b78us
16 GB DDR3 RAM
MinGW-w64 (x86_64-w64-mingw32-gcc-4.7.1.2-release-win64-rubenvb) mounted on /mingw/
MSYS (20111123) mounted on / and on /usr/
$ uname -a="MINGW32_NT-6.1 CHRIV-L09 1.0.17(0.48/3/2) 2011-04-24 23:39 i686 Msys"
$ which sed="/bin/sed.exe" (it's part of MSYS)
$ sed --version="GNU sed version 4.2.1"
This is the contents of PATH before manipulation:
PATH='.:/usr/local/bin:/mingw/bin:/bin:/c/PHP:/c/Program Files (x86)/HP SimplePass 2011/x64:/c/Program Files (x86)/HP SimplePass 2011:/c/Windows/system32:/c/Windows:/c/Windows/System32/Wbem:/c/Windows/System32/WindowsPowerShell/v1.0:/c/si:/c/android-sdk:/c/android-sdk/tools:/c/android-sdk/platform-tools:/c/Program Files (x86)/WinMerge:/c/ntp/bin:/c/GnuWin32/bin:/c/Program Files/MySQL/MySQL Server5.5/bin:/c/Program Files (x86)/WinSCP:/c/Program Files (x86)/Overlook Fing 2.1/bin:/c/Program Files/7-zip:.:/c/Program Files/TortoiseGit/bin:/c/Program Files (x86)/Git/bin:/c/VS10/VC/bin/x86_amd64:/c/VS10/VC/bin/amd64:/c/VS10/VC/bin'
This is an excerpt of /etc/profile (where I have begun the PATH manipulation):
set | grep --color=never ^PATH= | sed -e "s#^PATH=##" -e "s#'##g" \
-e "s/:/\n/g" -e "s#\n\(/[^\n]*tortoisegit[^\n]*\)#\nZ95-\1#ig" \
-e "s#\n\(/[a-z]/win\)#\nZ90-\1#ig" -e "s#\n\(/[a-z]/p\)#\nZ70-\1#ig" \
-e "s#\.\n#A10-.\n#g" -e "s#\n\(/usr/local/bin\)#\nA15-\1#ig" \
-e "s#\n\(/bin\)#\nA20-\1#ig" -e "s#\n\(/mingw/bin\)#\nA25-\1#ig" \
-e "s#\n\(/[a-z]/vs10/vc/bin\)#\nA40-\1#ig"
The last sed expression in that line basically looks for lines that begins with "/c/VS10/VC/bin" and prepends them with 'A40-' like this:
...
/c/si
A40-/c/VS10/VC/bin
A40-/c/VS10/VC/bin/amd64
A40-/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
I like my sed expressions to be flexible (path structures change), but I don't want it to match the lines that end with amd64 or x86_amd64 (those are going to have a different string prepended). So I change the last expression to:
-e "s#\n\(/[a-z]/vs10/vc/bin\)\n#\nA40-\1\n#ig"
This works:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
Then, (to match any "line" matching the pseudocode "/x/.../bin") I change the last expression to:
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
Which produces:
...
/c/si
/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
/c/GnuWin32/bin
...
??? - sed didn't match any character ('.') any number of times ('*') in the middle of the line ???
But, if I pipe the output into a different instance of sed (and compensate for sed handling each "line" seperately) like this:
| sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
I get:
sed: -e expression #1, char 30: unterminated `s' command
??? How is that unterminated? It's got all three '#' characters after the s, has the modifiers 'i' and 'g' after the third '#', and the entire expression is in double quotes ('"'). Also, there are no escapes ('\') immediately preceding the delimiters, and the delimiter is not a part of either the search or the replacement. Let's try a different delimiter than '#', like '~':
I use:
| sed -e "s~^(/[a-z]/.*/bin)$~A40-\1~ig"
and, I get:
...
/c/si
A40-/c/VS10/VC/bin
/c/VS10/VC/bin/amd64
/c/VS10/VC/bin/x86_amd64
A40-/c/GnuWin32/bin
...
And, that is correct! The only thing I changed was the delimeter from '#' to '~' and it worked ???
This is not (even close to) the first time that sed has produced unexplainable results for me.
Why, oh, why, is sed NOT matching syntax in an expression in the same instance, but IS matching when piped into another instance of sed?
And, why, oh, why, do I have to use a different delimeter when I do this (in order not to get an "unterminated 's' command"?
And the real reason I'm asking: Is this a bug in sed, OR, is it correct behavior that I don't understand (and if so, can someone explain why this behavior is correct)? I want to know if I'm doing it wrong, or if I need a different/better tool (or both, they don't have to be mutually exclusive).
I'll mark a response it as the answer if someone can either prove why this behavior is correct or if they can prove why it is a bug. I'll gladly accept any advice about other tools or different methods of using sed, but those won't answer the question.
I'm going to have to get better at other text processors (like awk, tr, etc.) because sed is costing me too much time with it's unexplainable results.
P.S. This is not the complete logic of my PATH manipulation. The complete logic also finishes prepending all the lines with values from 'A00-' to 'Z99-', then pipes that output into 'sort -u -f' and back into sed to remove those same prefixes on each line and to convert the lines ('\n') back into colons (':'). Then "export PATH='" is prepended to the single line and "'" is appended to it. Then that output is redirected into a temporary file. Next, that temporary file is sourced. And, finally, that temporary file is removed.
The /etc/profile script also displays the contents of PATH before and after sorting (in case it screwed up the path).
P.P.S. I'm sure there is a much better way to do this. It started as some very simple sed manipulations, and grew into the monster you see here. Even if there is a better way, I still need to know why sed is giving me these results.
sed -e "s#^\(/[a-z]/.*/bin\)$#A40-\1#ig"
is unterminated because the shell is trying to expand "$#A". Put your expressions in single quotes to avoid this.
The expression
-e "s#\n\(/[a-z]/.*/bin\)\n#\nA40-\1\n#ig"
fails, or doesn't do what you expect, because . matches the newline in a multi-line expression. Check your whole output, the A40- is at the very beginning. Change it to
-e "s#\n\(/[a-z]/[^\n]*/bin\)\n#\nA40-\1\n#ig"
and it might be more what you expect. This may very well be the case with most of your issues with multi-line modifications.
You can also put the statements, one per line, into a standalone file and invoke sed with sed -f editscript. It might make maintenance of this a bit easier.

Extracting string after matched pattern in Shell

How to extract whatever string comes after a matched pattern in Shell script. I know this functionality in Perl scripting, but i dont know in Shell scripting.
Following is the example,
Subject_01: This is a sample subject and this may vary
I have to extract whatever string that follows "Subject_01:"
Any help please.
It depends on your shell.
If you're using bourne shell or bash or (I believe) pdksh, then you can do fancy stuff like this:
$ string="Subject_01: This is a sample subject and this may vary"
$ output="${string#*: }"
$ echo $output
This is a sample subject and this may vary
$
Note that this is pretty limited in terms of format. The line above requires that you have ONE space after your colon. If you have more, it will pad the beginning of $output.
If you're using some other shell, you may have to do something like this, with the cut command:
> setenv string "Subject_01: This is a sample subject and this may vary"
> setenv output "`echo '$string' | cut -d: -f2`"
> echo $output
This is a sample subject and this may vary
> setenv output "`echo '$string' | sed 's/^[^:]*: *//'`"
> echo $output
This is a sample subject and this may vary
>
The first example uses cut, which is very small and simple. The second example uses sed, which can do far more, but is a (very) little heavier in terms of CPU.
YMMV. There's probably a better way to handle this in csh (my second example uses tcsh), but I do most of my shell programming in Bourne.