Removing bullet point characters from text file with sed - regex

I have a large text file in which some lines start with a bullet point (•). I'd like to remove those. I've tried
sed 's/\u2022//g' filename.txt
but that doesn't match the bullets. I've also tried pasting the bullet into my sed command, but also with no success.
E: The output of
sed --version
is
sed (GNU sed) 4.2.2
E2: If it helps figure out how to capture the bullet characters, they were originally added in Access.
E3: As suggesting in the comments,
echo -n '•' | hexdump -C
returns
00000000 95 |.|
00000001

I suggest with GNU sed:
sed 's/\x95//g' file

This is a working command for me:
# Force paste the bullet into the command line
sed 's/^•//g' filename.txt
If it doesn't work, try escaping with echo:
sed 's/^'"$(echo -ne '\u2022')"'//g' filename.txt
As PesaThe suggests, you can also use printf for escaping:
sed 's/^'"$(printf '\u2022')"'//g' filename.txt

It looks like sed doesn't understand \u sequences.
According to user manual it should be compatible with POSIX.2 BRE, which i think should work, but it doesn't.
You can try capturing the hexadecimal sequence (which i got using hexdump -C).
sed 's/^\xe2\x80\xa2//g' filename.txt
Or, alternatively, you could force bash to parse it. Just add a $ before the string.
sed $'s/\u2022//g' filename.txt

Related

sed remove lines that starts with a specific pattern

I'm trying to use sed command with a regex pattern that works fine with grep. But it's not matching nothing with sed command.
I have a text file and want to delete each line that starts with (wow or waw).
This is the command I'm using But it's not working.
sed -i '/^w\(o\|a\)w/d' text.txt
I tried using the same pattern with grep and it works fine:
grep '^w\(o\|a\)w' text.txt
Anything wrong with the regex in the sed command ?
With GNU sed, you can use
sed -i '/^w[oa]w/d' file
With FreeBSD sed, use
sed -i '' '/^w[oa]w/d' file
Here, [oa] is a bracket expression matching either o or a.
See an online sed demo:
sed '/^w[oa]w/d' <<< "wow 1
waw 2
wiw 3"
Output: wiw 3.

remove non latin-1 characters in a text file [duplicate]

I want to remove all the non-ASCII characters from a file in place.
I found one solution with tr, but I guess I need to write back that file after modification.
I need to do it in place with relatively good performance.
Any suggestions?
A perl oneliner would do: perl -i.bak -pe 's/[^[:ascii:]]//g' <your file>
-i says that the file is going to be edited inplace, and the backup is going to be saved with extension .bak.
# -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME
I tried all the solutions and nothing worked. The following, however, does:
tr -cd '\11\12\15\40-\176'
Which I found here:
https://alvinalexander.com/blog/post/linux-unix/how-remove-non-printable-ascii-characters-file-unix
My problem needed it in a series of piped programs, not directly from a file, so modify as needed.
sed -i 's/[^[:print:]]//' FILENAME
Also, this acts like dos2unix
Try tr instead of sed
tr -cd '[:print:]' < file.txt
# -i (inplace)
LANG=C sed -i -E "s|[\d128-\d255]||g" /path/to/file(s)
The LANG=C part's role is to avoid a Invalid collation character error.
Based on Ivan's answer and Patrick's comment.
This worked for me:
sed -i 's/[^[:print:]]//g'
I'm using a very minimal busybox system, in which there is no support for ranges in tr or POSIX character classes, so I have to do it the crappy old-fashioned way. Here's the solution with sed, stripping ALL non-printable non-ASCII characters from the file:
sed -i 's/[^a-zA-Z 0-9`~!##$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILE
As an alternative to sed or perl you may consider to use ed(1) and POSIX character classes.
Note: ed(1) reads the entire file into memory to edit it in-place, so for really large files you should use sed -i ..., perl -i ...
# see:
# - http://wiki.bash-hackers.org/doku.php?id=howto:edit-ed
# - http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes
# test
echo $'aaa \177 bbb \200 \214 ccc \254 ddd\r\n' > testfile
ed -s testfile <<< $',l'
ed -s testfile <<< $'H\ng/[^[:graph:][:space:][:cntrl:]]/s///g\nwq'
ed -s testfile <<< $',l'
awk '{ sub("[^a-zA-Z0-9\"!##$%^&*|_\[](){}", ""); print }' MYinputfile.txt > pipe_out_to_CONVERTED_FILE.txt
I appreciate the tips I found on this site.
But, on my Windows 10, I had to use double quotes for this to work ...
sed -i "s/[\d128-\d255]//g" FILENAME
Noticed these things ...
For FILENAME the entire path\name needs to be quoted
This didn't work -- %TEMP%\"FILENAME"
This did -- %TEMP%\FILENAME"
sed leaves behind temp files in the current directory, named sed*

Get specific Text between Specific Tags

At the top of my HTML files, I have...
<H2>City</H2>
<P>Liverpool</P>
or
<H2>City</H2>
<P>Dublin</P>
I want to output the text between the tags straight after <H2>City</H2> instances. So in the examples above which are separate files, I want to print out Liverpool and in the second example, Dublin.
Looking at this thread, I try:
sed -e 's/City\(.*\)\/P/\1/'
which I hope would get me half way there... but that just prints out the entire file. Any ideas?
awk to the rescue! You need multi-char RS support though (gawk has it)
$ awk -F'[<>]' -v RS='<H2>City</H2>' 'NF{print $3}' file
another approach can be
$ awk 'c&&c--{sub(/<[^>]*>/,""); print} /<H2>City<\/H2>/{c=1}' file
find the next record after City and trim the angle brackets...
Try using the following regex :
(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)
see regex demo / explanation
sed
sed -e 's/(?s)(?<=City<\/H2>\n<P>).*?(?=<\/P>)/'
I checked and the \s seem not work for spaces. You should use the newline character \n:
sed -e 's/<H2>City<\/H2>\n<P>\(.*\)<\/P>/\1/'
There is no need of use lookbehind (like above), that is an overkill.
With sed, you can use the n command to read next line after your pattern. Then just remove the tag to output your content:
sed -n '/<H2>City<\/H2>/n;s/ *<\/*P> *//gp;' file
I think this should work in your mac:
echo -e "<H2>City</H2>\n<P>Dublin</P>" |awk -F"[<>]" '/City/{getline;print $3}'
Dublin

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'
It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.
For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

sed: mix explicit and regex phrases

I'm trying to write a sed command to remove a specific string followed by two digits. So far I have:
sed -e 's/bizzbuzz\([0-9][0-9]\)//' file.txt
but I cant seem to get the syntax right. Any suggestions?
sed -re 's/bizzbuzz[0-9]{2}//' file.txt
and
sed -re 's/\bbizzbuzz[0-9]{2}\b//' file.txt
if the searched string have word boundary
sed -e 's/bizzbuzz[0-9]\{2\}//' file.txt
if you don't have GNU sed
Your current approach seems like it should work fine:
$ echo 'FOO bizzbuzz56 BAR' | sed -e 's/bizzbuzz\([0-9][0-9]\)//'
FOO BAR
As said in other answer, the syntax seems to be fine (with unnecesary parenthesis).
But may be you want to replace all the strings found in each line ? In that case, you should add a 'g' at the end of the 's' command:
sed -e 's/bizzbuzz\([0-9][0-9]\)//g' file.txt