remove non latin-1 characters in a text file [duplicate]

remove non latin-1 characters in a text file [duplicate] - regex

I want to remove all the non-ASCII characters from a file in place.
I found one solution with tr, but I guess I need to write back that file after modification.
I need to do it in place with relatively good performance.
Any suggestions?

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>
-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

# -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME

I tried all the solutions and nothing worked. The following, however, does:
tr -cd '\11\12\15\40-\176'
Which I found here:
https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix
My problem needed it in a series of piped programs, not directly from a file, so modify as needed.

sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix

Try tr instead of sed
tr -cd '[:print:]' < file.txt

# -i (inplace)
LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)
The LANG=C part's role is to avoid a Invalid collation character error.
Based on Ivan's answer and Patrick's comment.

This worked for me:
sed -i 's/[^[:print:]]//g'

I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:
sed -i 's/[^a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE

As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.
Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...
# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes
# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l'
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'

awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt

I appreciate the tips I found on this site.
But, on my Windows 10, I had to use double quotes for this to work ...
sed -i "s/[\d128-\d255]//g" FILENAME
Noticed these things ...
For FILENAME the entire path\name needs to be quoted
This didn't work -- %TEMP%\"FILENAME"
This did -- %TEMP%\FILENAME"
sed leaves behind temp files in the current directory, named sed*

Related

General solutions to replace string regex preceded and followed by '\n'

I have a file in CentOS which looks like following
[root#localhost nn]# cat -A excel.log
real1$
0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I$
real2$
0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I$
real3$
0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I$
real4$
0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I1^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I$
real5$
0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I1^I0.5^I0.5^I0.5^I0.5^I$
real6$
I would like to replace \nreal[2-6]\n with \t\t\t' and have tried unsuccessfully the following
sed -i 's/\nreal[2-6]\n/\t\t\t/g' file
It seems that sed has difficulty to deal with line break. Any idea to fulfill the regex in CentOS?
Much appreciated!

If you want to consider perl then use:
perl -i -0777 -pe 's/\n(?:51[23]real|real[2-6])(?:\n|\z)/\t\t\t/g' file
If you want to avoid last real\d+ line to be replaced with \t\t\t then use:
perl -i -0777 -pe 's/\n(?:51[23]real|real[2-6])\n(?!\z)/\t\t\t/g' file
(?!\z) is negative lookahead to fail the match when we have line end just ahead of us.

With GNU sed, you need to use the -z option:
sed -i -z 's/\nreal[2-6]\n/\t\t\t/g' file
# ^^
Now, that you also want to handle specific alternations, you need to enable the POSIX ERE syntax, either with -r or -E option:
sed -i -Ez 's/\n(51[23]real|real[2-6])\n/\t\t\t/g' file

Removing bullet point characters from text file with sed

I have a large text file in which some lines start with a bullet point (•). I'd like to remove those. I've tried
sed 's/\u2022//g' filename.txt
but that doesn't match the bullets. I've also tried pasting the bullet into my sed command, but also with no success.
E: The output of
sed --version
is
sed (GNU sed) 4.2.2
E2: If it helps figure out how to capture the bullet characters, they were originally added in Access.
E3: As suggesting in the comments,
echo -n '•' | hexdump -C
returns
00000000 95 |.|
00000001

I suggest with GNU sed:
sed 's/\x95//g' file

This is a working command for me:
# Force paste the bullet into the command line
sed 's/^•//g' filename.txt
If it doesn't work, try escaping with echo:
sed 's/^'"$(echo -ne '\u2022')"'//g' filename.txt
As PesaThe suggests, you can also use printf for escaping:
sed 's/^'"$(printf '\u2022')"'//g' filename.txt

It looks like sed doesn't understand \u sequences.
According to user manual it should be compatible with POSIX.2 BRE, which i think should work, but it doesn't.
You can try capturing the hexadecimal sequence (which i got using hexdump -C).
sed 's/^\xe2\x80\xa2//g' filename.txt
Or, alternatively, you could force bash to parse it. Just add a $ before the string.
sed $'s/\u2022//g' filename.txt

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'

It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.

For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

sed - Commenting a line matching a specific string AND that is not already commented out

I have the following test file
AAA
BBB
CCC
Using the following sed I can comment out the BBB line.
# sed -e '/BBB/s/^/#/g' -i file
I'd like to only comment out the line if it does not already has a # at the begining.
# sed -e '/^#/! /BBB/s/^/#/g' file
sed: -e expression #1, char 7: unknown command: `/'
Any ideas how I can achieve this?

Assuming you don't have any lines with multiple #s this would work:
sed -e '/BBB/ s/^#*/#/' -i file
Note: you don't need /g since you are doing at most one substitution per line.

Another solution with the & special character which references the whole matched portion of the pattern space. It's a bit simpler/cleaner than capturing and referencing a regexp group.
sed -i 's/^[^#]*BBB/#&/' file

I find this solution to work the best.
sed -i '/^[^#]/ s/\(^.*BBB.*$\)/#\ \1/' file
It doesn't matter how many "#" symbols there are, it will never add another one. If the pattern you're searching for does not include a "#" it will add it to the beginning of the line, and it will also add a trailing space.
If you don't want a trailing space
sed -i '/^[^#]/ s/\(^.*BBB.*$\)/#\1/' file

Assuming the BBB is at the beginning of a line, I ended up using an even simpler expression:
sed -e '/^BBB/s/^/#/' -i file
One more note for the future me. Do not overlook the -i. Because this won't work: sed -e "..." same_file > same_file.

sed -i '/![^#]/ s/\(^.*BBB.*$\)/#\ \1/' file
This doesn't work for me with the keyword *.sudo, no comments at all...
Ony the syntax below works:
sed -e '/sudo/ s/^#*/#/' file

Actually, you don't need the exclamation sign (!) as the caret symbol already negates whatever is inside the square brackets and will ignore all hash symbol from your search. This example worked for me:
sed -i '/[^#]/ s/\(^.*BBB.*$\)/#\ \1/' file

Comment all "BBB", if it's haven't comment yet.
sed -i '/BBB/s/^#\?/#/' file

If BBB is at the beginning of the line:
sed 's/^BBB/#&/' -i file
If BBB is in the middle of the line:
sed 's/^[^#]*BBB/#&/' -i file

I'd usually supply sed with -i.bak to backup the file prior to making changes to the original copy:
sed -i.bak '/BBB/ s/^#*/#/' file
This way when done, I have both file and file.bak and I can decide to delete file.bak only after I'm confident.

If you want to comment out not only exact matches for 'BBB' but also lines that have 'BBB' somewhere in the middle, you can go with following solution:
sed -E '/^([^#].*)?BBB/ s/^/#/'
This won't change any strings that are already commented out.

How do I remove colons from a list of MAC addresses?

I'm having a hard time trying to remove the colons in a list of MAC addresses.
My file:
00:21:5A:28:62:BF
00:24:81:0A:04:44
Expected Output:
00215A2862BF
0024810A0444

Given your tags, you want to accomplish this in a shell:
cat file | sed s/://g
edit: you don't really need the cat either if you are reading from a file:
sed s/://g file

perl -pe "s/://g" yourfile

echo "00:21:5A:28:62:BF" | sed -e 's/://g'
00215A2862BF

tr -d ':' < file
will probably work too, though I don't have a command line handy to check the syntax.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

remove non latin-1 characters in a text file [duplicate] - regex

I want to remove all the non-ASCII characters from a file in place. I found one solution with tr, but I guess I need to write back that file after modification. I need to do it in place with relatively good performance. Any suggestions?

A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file> -i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.

# -i (inplace) sed -i 's/[\d128-\d255]//g' FILENAME

sed -i 's/[^[:print:]]//' FILENAME Also, this acts like dos2unix

Try tr instead of sed tr -cd '[:print:]' < file.txt

# -i (inplace) LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s) The LANG=C part's role is to avoid a Invalid collation character error. Based on Ivan's answer and Patrick's comment.

This worked for me: sed -i 's/[^[:print:]]//g'

awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt

Related

General solutions to replace string regex preceded and followed by '\n'

Removing bullet point characters from text file with sed

Extract few matching strings from matching lines in file using sed

sed - Commenting a line matching a specific string AND that is not already commented out

How do I remove colons from a list of MAC addresses?

Categories

Resources