How to remove non-ascii chars using sed

How to remove non-ascii chars using sed - regex

I want to remove non-ascii chars from some file. I have already tried these many regexs.
sed -e 's/[\d00-\d128]//g' # not working
cat /bin/mkdir | sed -e 's/[\x00-\x7F]//g' >/tmp/aa
but this file contains some non-ascii chars.
[root#asssdsada ~]$ hexdump /tmp/aa |more
00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F 0123456789ABCDEF
00000000 45 4C 46 B0 F0 73 38 C0 - C0 BC BC FF FF 61 61 61 ELF..s8......aaa
00000010 A0 A0 50 E5 74 64 50 57 - 50 57 50 57 D4 D4 51 E5 ..P.tdPWPWPW..Q.
00000020 74 64 6C 69 62 36 34 6C - 64 6C 69 6E 75 78 78 38 tdlib64ldlinuxx8
00000030 36 36 34 73 6F 32 47 4E - 55 42 C8 C0 80 70 69 42 664so2GNUB...piB
00000040 44 47 BA E3 92 43 45 D5 - EC 46 E4 DE D8 71 58 B9 DG...CE..F...qX.
00000050 8D F1 EA D3 EF 4B 86 FC - A9 DA 79 ED 63 B5 51 92 .....K....y.c.Q.
00000060 BA 6C FC D1 69 78 30 ED - 74 F1 73 95 CC 85 D2 46 .l..ix0.t.s....F
00000070 A5 B4 6C 67 DA 4A E9 9A - 4B 58 77 A4 37 80 C0 4F ..lg.J..KXw.7..O
00000080 F3 E9 B2 77 65 97 74 F9 - A2 C0 F2 CC 4A 9C 58 A1 ...we.t.....J.X.

This doesn't seem to work with sed. Perhaps tr will do?
tr -d '\200-\377'
Or with the complement:
tr -cd '\000-\177'

Did you try
cat /bin/mkdir | tr -cd "[:print:]"
I think it solves the problem ?
If only text content interest you, you can also use
cat /bin/mkdir | strings

Do you know what encoding the file is currently using? If so, you can use iconv to convert it. It's a utility to convert from one character encoding to another. So if the original file is in UTF-8 and you want to convert to ASCII you can use the following:
iconv -f utf8 -t ascii <inputfile>
The file command on the input file might tell you the current encoding.
Interestingly, there's a command called enca which will do its best to determine the character encoding being used if you know the language of the contents of the file.
This other question might be the answer.

The solutions offered here did not work for me. Maybe my problem was different, but I needed to strip the ASCII colors and other characters from the otherwise pure ASCII text.
The following worked for me, however:
Stripping Escape Codes from ASCII Text
sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g'
In context (BASH):
$ printf "\e[32;1mhello\e[0m\n"
hello
$ printf "\e[32;1mhello\e[0m\n" | cat -vet
^[[32;1mhello^[[0m$
$ printf "\e[32;1mhello\e[0m\n" | sed -E 's/\x1b\[[0-9]*;?[0-9]+m//g' | cat -vet
hello$

Try with sed -i option, eg.
sed -i 's/[\d128-\d255]//g' MYFILE.txt
it will replace all non-ascii characters in the file.

Related

how to find angle braket literals with grep

In the following example, the second match "<_dl_start_user>" was unexpected:
$ objdump -D /lib64/ld-linux-x86-64.so.2|grep -A5 '<_start>:'
0000003ba0400b30 <_start>:
3ba0400b30: 48 89 e7 mov %rsp,%rdi
3ba0400b33: e8 28 06 00 00 callq 3ba0401160 <_dl_start>
0000003ba0400b38 <_dl_start_user>:
3ba0400b38: 49 89 c4 mov %rax,%r12
how can I match exactly '<_start>:' ?

You are matching <_start>: exactly. You're also seeing 5 lines of trailing context after the match because you specified -A5.

grep -v pattern and also remove 1 line before and 4 lines after [duplicate]

This question already has answers here:
How to exclude several lines around match with grep or similar tool?
(2 answers)
Closed 8 years ago.
I would like to grep a pattern and remove the line of the matching pattern and also 1 line before and 4 lines after the context. I tried:
grep -v -A 4 -B 1
Thanks in advance!
Example:
Rule: r1
Owner: Process explorer.exe Pid 1544
0x01ec350f 8b 45 a8 0f b6 00 8d 4d a8 ff 14 85 c8 7f ed 01 .E.....M........
0x01ec351f 84 c0 75 ec 8b 4d fc e8 ba f5 fe ff f7 85 b0 fd ..u..M..........
0x01ec352f ff ff 00 00 01 00 75 13 33 c0 50 50 50 68 48 28 ......u.3.PPPhH(
0x01ec353f eb 01 33 d2 8b cb e8 b0 57 ff ff f7 05 8c 9b ed ..3.....W.......
I would like to grep "explorer.exe" and remove the line and also 1 line before and 4 lines after.

awk
this awk one-liner would help:
awk 'NR==FNR{if(/explorer[.]exe/)d[++i]=NR;next}
{for(x=1;x<=i;x++)if(FNR>=d[x]-1&&FNR<=d[x]+4)next}7' file file
see this example:
kent$ cat f
foo
foo2
Rule: r1
Owner: Process explorer.exe Pid 1544
remove1
remove2
remove3
remove4
bar
bar2
kent$ awk 'NR==FNR{if(/explorer[.]exe/)d[++i]=NR;next}{for(x=1;x<=i;x++)if(FNR>=d[x]-1&&FNR<=d[x]+4)next}7' f f
foo
foo2
bar
bar2
vim
if vim is also possible for you, it could be a lot easier:
:g/Pattern/norm! k6dd
Note, the vim solution would have problem in first match if your pattern was on the 1st line in your file.

SED grabbing special characters

I’m trying to fix an encoding error in an archived html page. My problem is that sed is behaving strangely, as it doesn't catch special characters in the data. I tried both with and without -r switch.
My data is the following:
Budapesti ??p?t?©szeti Filmnapok k??l??nkiad??s
The sed command:
sed -i.bak 's|Budapesti.*|REPLACE|g' index.html
and the result I get without recode:
REPLACE�t?�szeti Filmnapok k??l??nkiad??s
The result I'm expecting is:
REPLACE
It seems to be related to the encoding somehow. If I do recode iso-8859-2 index.html first, sed works fine and gets me the expected output.
Here are the hex bytes for the i ??p?t?Šs part before recode:
69 20 3F 3F 70 3F AD 74 3F A9 73
and after recode:
69 20 3F 3F 70 3F C2 AD 74 3F C5 A0 73
BTW, this is what I get without recode:
REPLACEt?Šs
52 45 50 4C 41 43 45 AD 74 3F A9 73
I'm using the latest gsed (GNU sed) 4.2.2.

LANG=C.ISO-8859-2 sed -i.bak 's|Budapesti.*|REPLACE|g' index.html
Cygwin terminal not displaying certain characters?

Finding a integer number after a beginning t=

I have a string like this:
33 00 4b 46 ff ff 03 10 30 t=25562
I am only interested in the five digits at the very end after the t=
How can I get this numbers with a regular expression out of it?
I tried grep t=..... but I also got all characters including the t= in the beginning, which I would like to drop?
After finding that five digit number, I would like to divide this by 1000. So in the above mentioned case the number 25.562. Is this possible with grep and regular expressions?
Thanks for your help.

Using awk
echo '33 00 4b 46 ff ff 03 10 30 t=25562' | awk -F= '{print $2/1000}'
Output:
25.562
EDIT
As pointed out by #anubhava in comment, above assumes = is not present anywhere before t=. If that's not the case,
echo '33 00 4b 46 ff ff 03 10 30 t=25562' | awk -F' t=' '{print $2/1000}'

This should be OK. It will only get value form t= at the end. OP also post that t= could exist as extra character in the middle of the line, but hi only like to get the one at the end of the line.
echo '33 00 4b 46 ff ff t=22 03 10 30 t=25562' | awk '{split($NF,a,"=");print a[2]/1000}'
25.562
Another variation
echo '33 00 4b 46 ff ff t=22 03 10 30 t=25562' | awk 'END {print $0/1000}' RS==
25.562

Convert hex stream to GIF

How do I create a .gif file from the following HEX stream:
0d 0a 0d 0a 47 49 46 38 39 61 01 00 01 00 80 ff 00 ff ff ff 00 00 00 2c 00 00 00 00 01 00 01 00 00 02 02 44 01 00 3b
3b is the GIF file terminator
I'm trying to do it following the guide at http://en.wikipedia.org/wiki/Graphics_Interchange_Format#Example_GIF_file
I'd like to implement this in either Perl or C/C++. Any language will do though.
Many thanks in advance,
Thanks guys for all the replies. I removed the leading '0d 0a 0d 0a'...
Here's what I have sofar:
#!/usr/bin/perl
use strict;
use warnings;
open(IN,"<test.txt");
open(OUT,">test.gif");
my #lines=<IN>;
foreach my $line (#lines){
$line=~s/\n//g;
my #bytes=split(/ /,$line);
foreach my $byte (#bytes){
print OUT $byte;
}
}
close(OUT);
close(IN);

You can do it in the shell, using GNU echo:
$ /bin/echo -n -e "\x47\x49\x46\x38\x39\x61\x01\x00\x01\x00\x80\xff\x00\xff\xff\xff\x00\x00\x00\x2c\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02\x44\x01\x00\x3b" > foo.gif
$ identify foo.gif
foo.gif GIF 1x1 1x1+0+0 8-bit PseudoClass 2c 36B 0.000u 0:00.000
You can also use the xxd command which will "make a hexdump or do the reverse". Annoyingly, however, it seems very picky about the input format. Here's an example using xxd:
$ cat > mygif.hex <<END
0000000: 4749 4638 3961 0100 0100 80ff 00ff ffff
0000010: 0000 002c 0000 0000 0100 0100 0002 0244
0000020: 0100 3b0a
END
$ xxd -r < mygif.hex > mygif.gif
gvim has an interface to xxd. Use the "Tools → Convert To Hex" menu option (keyboard: :%!xxd) and then "Tools → Convert Back" (:%!xxd -r).
EMACS also has a built-in hex editor, which is accessed by M-x hexl-mode (see Editing Binary Files in the manual). It's also a little bit annoying, because you have to type C-M-x (i.e. Ctrl-Meta-X) before entering a character by its hex code:
Of course, it is very easy to write a simple C program to do the conversion:
#include <stdio.h>
int main(int argc, char **argv) {
unsigned int c;
while (1 == scanf("%x", &c))
putchar(c);
return 0;
}
usage:
$ gcc -Wall unhexify.c -o unhexify
$ echo "47 49 46 38 39 61 01 00 01 00 80 ff
00 ff ff ff 00 00 00 2c 00 00 00 00
01 00 01 00 00 02 02 44 01 00 3b" | ./unhexify > mygif.gif
Also: many answers here in this code golf question.

Open a new file for writing (in binary mode) with the .gif extension.
Read each pair of hex characters.
Convert the hex to a byte (char) value.
Write the byte to the opened file.
When finished, close the file.
If the hex data represents a GIF image, the file should contain it.

perl -ne'
BEGIN { binmode STDOUT }
s/\s//g;
print pack "H*", $_;
' file.hex > file.gif
Perl 5.14:
perl -ne'
BEGIN { binmode STDOUT }
print pack "H*", s/\s//gr;
' file.hex > file.gif
(-n and print can be replaced with -p and $_ = if you want to golf (shorten the length of the program.)

You may want to read the documentation for unpack (or hex) and pack. You may also find the perlio documentation useful for creating a raw file handle (so perl doesn't try to help you with things like encodings or line endings).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to remove non-ascii chars using sed - regex

This doesn't seem to work with sed. Perhaps tr will do? tr -d '\200-\377' Or with the complement: tr -cd '\000-\177'

Did you try cat /bin/mkdir | tr -cd "[:print:]" I think it solves the problem ? If only text content interest you, you can also use cat /bin/mkdir | strings

Try with sed -i option, eg. sed -i 's/[\d128-\d255]//g' MYFILE.txt it will replace all non-ascii characters in the file.

Related

how to find angle braket literals with grep

grep -v pattern and also remove 1 line before and 4 lines after [duplicate]

SED grabbing special characters

Finding a integer number after a beginning t=

Convert hex stream to GIF

Categories

Resources