Delete any special character using Sed - regex

I have yet another list of subdomain. I want to remove any Wildcard subdomain which include these special characters:
()!&$#*+?
Mostly, the data are prefixly random. Also, could be middle. Here's some sample of output data
(www.imgur.com
***************diet.blogspot.com
*-1.gbc.criteo.com
------------------------------------------------------------i.imgur.com
This has been quite an inconvenience while scanning through the list. As always, I'm trying sed to fix it:
sed -i "/[!()#$&?+]/d" foo.txt ###Didn't work
sed -i "/[\!\(\)\#\$\&\?\+]/d" ###Escaping char didn't work
Performing commands above still result in an unchanged list and the file still on original state. I'm thinking that; to fix this is to pipe series of sed command in order to remove it one by one:
cat foo.txt | sed -e "/!/d" -e "/#/d" -e "/\*/d" -e "/\$/d" -e "/(/d" -e "/)/d" -e "/+/d" -e "/\'/d" -e "/&/d" >> foo2.txt
cat foo.txt | sed -e "/\!/d" | sed -e "/\#/d" | sed -e "/\*/d" | sed -e "/\$/d" | sed -e "/\+/d" | sed -e "/\'/d" | sed -e "/\&/d" >> foo2.txt
If escaping all special char doesn't work, it must've been my false logic. Also tried with /g still doesn't increase my luck.
As a side note: I don't want - to be deleted as some valid subdomain can have - character:
line-apps.com
line-apps-beta.com
line-apps-rc.com
line-apps-dev.com
Any help would be cherished.

Using sed
$ sed '/[[:punct:]]/d' input_file
This should delete all lines with special characters, however, it would help if you provided sample data.

To do what you're trying to do in your answer (which adds [ and ] and more to the set of characters in your question) would be:
sed '/[][!?+,#$&*() ]/d'
or just:
grep -v '[][!?+,#$&*() ]'
Per POSIX to include ] in a bracket expression it must be the first character otherwise it indicates the end of the bracket expression.
Consider printing lines you want instead of deleting lines you do not want, though, e.g.:
grep '^[[:alnum:]_.-]$' file
to print lines that only contain letters, numbers, underscores, dashes, and/or periods.

Related

Extract text between two markers and replace a character

I want to change
>lcl|ORF183:9482:8118 unnamed protein product
into
>ORF183:9482-8118
Keep everything after | and before 'white space', plus replacing second : to -
So far I'm doing it with the following code:
sed -e '/^>/s/ .*//' -e '/^>/s/|/ /' -e '/^>/s/lcl //' -e '/^>/s/\(.*\):/\1-/'
but wish to do it in a simpler one-line code.
This could work:
sed -e 's/\(^.*|\)\(.*\):\(.*\):\(.*\)[[:space:]]\(unnamed.*$\)/>\2:\3-\4/'
Here's some improvements based on code you've tried
$ sed -e '/^>/s/ .*//' -e '/^>/s/lcl|//' -e '/^>/s/:/-/2' ip.txt
>ORF183:9482-8118
-e '/^>/s/|/ /' -e '/^>/s/lcl //' can be simplified to -e '/^>/s/lcl|//'
use s/>[^|]*|/>/ if you wish to match any text between > and |
sed allows to specify which occurrence of the match you want to replace, s/:/-/2 means replace the 2nd : to -
If your sed implementation allows grouping, you can group all the commands (separated by ;) inside {} for a particular address
$ sed '/^>/{s/ .*//; s/lcl|//; s/:/-/2}' ip.txt
>ORF183:9482-8118
Please visit https://stackoverflow.com/tags/sed/info for learning resources and other goodies

sed: struggling with substitution and regex for ^*=

I am running a linux bash script. From stout lines like: /gpx/trk/name=MyTrack1, I want to keep only the end of line after =.
I am struggling to understand why the following sed command is not working as I expect:
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
(I also tried)
echo "/gpx/trk/name=MyTrack1" | sed -e "s/^*\=//"
The return is always /gpx/trk/name=MyTrack1 and not MyTrack1
An even simpler way if this is the only structure you are concerned about:
echo "/gpx/trk/name=MyTrack1" | cut -d = -f 2
Simply try:
echo "/gpx/trk/name=MyTrack1" | sed 's/.*=//'
Solution 2nd: With another sed.
echo "/gpx/trk/name=MyTrack1" | sed 's/\(.*=\)\(.*\)/\2/'
Explanation: As per OP's request adding explanation for this code here:
s: Means telling sed to do substitution operation.
\(.*=\): Creating first place in memory to keep this regex's value which tells sed to keep everything in 1st place of memory from starting to till = so text /gpx/trk/name= will be in 1 place.
\(.*\): Creating 2nd place in memory for sed telling it to keep everything now(after the match of 1st one, so this will start after =) and have value in it as MyTrack1
/\2/: Now telling sed to substitute complete line with only 2nd memory place holder which is MyTrack1
Solution 3rd: Or with awk considering that your Input_file is same as shown samples.
echo "/gpx/trk/name=MyTrack1" | awk -F'=' '{print $2}'
Solution 4th: With awk's match.
echo "/gpx/trk/name=MyTrack1" | awk 'match($0,/=.*$/){print substr($0,RSTART+1,RLENGTH-1)}'
$ echo "/gpx/trk/name=MyTrack1" | sed -e "s/^.*=//"
MyTrack1
The regular expression ^.*= matches anything up to and including the last = in the string.
Your regular expression ^*= would match the literal string *= at the start of a string, e.g.
$ echo "*=/gpx/trk/name=MyTrack1" | sed -e "s/^*=//"
/gpx/trk/name=MyTrack1
The * character in a regular expression usually modifies the immediately previous expression so that zero or more of it may be matched. When * occurs at the start of an expression on the other hand, it matches the character *.
Not to take you off the sed track, but this is easy with Bash alone:
$ echo "$s"
/gpx/trk/name=MyTrack1
$ echo "${s##*=}"
MyTrack1
The ##*= pattern removes the maximal pattern from the beginning of the string to the last =:
$ s="1=2=3=the rest"
$ echo "${s##*=}"
the rest
The equivalent in sed would be:
$ echo "$s" | sed -E 's/^.*=(.*)/\1/'
the rest
Where #*= would remove the minimal pattern:
$ echo "${s#*=}"
2=3=the rest
And in sed:
$ echo "$s" | sed -E 's/^[^=]*=(.*)/\1/'
2=3=the rest
Note the difference in * in Bash string functions vs a sed regex:
The * in Bash (in this context) is glob like - itself means 'any character'
The * in a regex refers to the previous pattern and for 'any character' you need .*
Bash has extensive string manipulation functions. You can read about Bash string patterns in BashFAQ.

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'
It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.
For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

Sed : print all lines after match

I got my research result after using sed :
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | cut -f 1 - | grep "pattern"
But it only shows the part that I cut. How can I print all lines after a match ?
I'm using zcat so I cannot use awk.
Thanks.
Edited :
This is my log file :
[01/09/2015 00:00:47] INFO=54646486432154646 from=steve idfrom=55516654455457 to=jone idto=5552045646464 guid=100021623456461451463 n
um=6 text=hi my number is 0 811 22 1/12 status=new survstatus=new
My aim is to find all users that spam my site with their telephone numbers (using grep "pattern") then print all the lines to get all the information about each spam. The problem is there may be matches in INFO or id, so I use sed to get the text first.
Printing all lines after a match in sed:
$ sed -ne '/pattern/,$ p'
# alternatively, if you don't want to print the match:
$ sed -e '1,/pattern/ d'
Filtering lines when pattern matches between "text=" and "status=" can be done with a simple grep, no need for sed and cut:
$ grep 'text=.*pattern.* status='
You can use awk
awk '/pattern/,EOF'
n.b. don't be fooled: EOF is just an uninitialized variable, and by default 0 (false). So that condition cannot be satisfied until the end of file.
Perhaps this could be combined with all the previous answers using awk as well.
Maybe this is what you actually want? Find lines matching "pattern" and extract the field after text= up through just before status=?
zcat file* | sed -e '/pattern/s/.*text=\(.*\)status=[^/]*/\1/'
You are not revealing what pattern actually is -- if it's a variable, you cannot use single quotes around it.
Notice that \(.*\)status=[^/]* would match up through survstatus=new in your example. That is probably not what you want? There doesn't seem to be a status= followed by a slash anywhere -- you really should explain in more detail what you are actually trying to accomplish.
Your question title says "all line after a match" so perhaps you want everything after text=? Then that's simply
sed 's/.*text=//'
i.e. replace up through text= with nothing, and keep the rest. (I trust you can figure out how to change the surrounding script into zcat file* | sed '/pattern/s/.*text=//' ... oops, maybe my trust failed.)
The seldom used branch command will do this for you. Until you match, use n for next then branch to beginning. After match, use n to skip the matching line, then a loop copying the remaining lines.
cat file | sed -n -e ':start; /pattern/b match;n; b start; :match n; :copy; p; n ; b copy'
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | ***cut -f 1 - | grep "pattern"***
instead change the last 2 segments of your pipeline so that:
zcat file* | sed -e 's/.*text=\(.*\)status=[^/]*/\1/' | **awk '$1 ~ "pattern" {print $0}'**

sed for removing trailing zeroes - regex - nongreedy

I have a file which has few lines as below
ABCD|100.19000|90.100|1000.000010|SOMETHING
BCD|10.100|90.1|100.019900|SOMETHING
Now, after applying sed on this, I would like the output to be as below (To use it for further processing)
ABCD|100.19|90.1|1000.00001|SOMETHING
BCD|10.1|90.1|100.0199|SOMETHING
i.e. I would like all the trailing zeros (the ones before the |) to be removed from the result.
I tried the following: (regtest is the file containing the original data as shown above)
cat regtest | sed 's/|\([0-9]*\)\.\([0-9]*\)0*|/|\1\.\2|/g'
Did not work as I think it's greedy.
cat regtest | sed 's/|\([0-9]*\)\.\([0-9]*\)0|/|\1\.\2|/g'
Will work. But, I will have to apply this sed command repeatedly on the same file to remove the zeros one after another. Does not make sense.
How can I go about it? Thanks!
$ echo "ABCD100|100.19000|90.100|1000.000010|STH" | \
sed -r -e 's/\|/||/g' -e 's/(\|[0-9.]+[1-9])0+\|/\1|/g' -e 's/\|\|/|/g'
ABCD100|100.19|90.1|1000.00001|STH
If you want to depend on the | following the zeroes to be removed
cat regtest | sed -r 's/(00*)(\|)/\2/g'
If you want to remove zeroes not trailed by a . or a digit
cat regtest | sed -r 's/(00*)([^.0-9])/\2/g'
(Note I'm using the 00* instead of 0+ to avoid unique features of GNU sed not available in other versions)
Edit: answer to comment request for removing trailing zeroes only between a decimal point and a pipe:
cat regtest | sed -r 's/(\.[1-9])*(00*)(\|)/\1\3/g'
Using Perl's extended regular expressions
perl -pe 's{\.\d*?\K0*(\||$)}{$1}g'
This removes zeroes that occur between (a dot and optionally some digits) and (a pipe or the end of the line).