Deleting strings with Sed regex - regex

I would like to do a string replacement on the command line. I can do this in Python, but it would be easier for my workflow if I just do this in Unix. Currently I'm trying to get this to work with sed.
I am trying to delete any information surrounded by single quotes. Inside the quotes, I have varying combinations of letters, numbers, spaces, dashes, square brackets, underscores, and semicolons.
Here's an example...
(214016:0.13461,814430:0.04526)'o__stuff; f__[morestuff-123]':0.03063
In python, I can do this...
line = "(214016:0.13461,814430:0.04526)'o__stuff; f__[morestuff-123]':0.03063"
sub(r"\'[ \w;\-\[\]]+\'","",line)
Which correctly prints...
(214016:0.13461,814430:0.04526):0.03063
I'm now trying to do this with sed, which hasn't worked out for me so far. I've been trying to work with this tutorial, which has been helpful. Here's what I've got...
sed "s/\'[-[:alnum:] ;\[\]]+\'//g" file.txt
This doesn't work. Any thoughts on what is wrong?
Thanks for any help!

This might work for you (GNU sed):
sed 's/'\''[^'\'']*'\''//g' file
N.B. the expression '\'' is a shell device to represent a single '
sed "s/'[^']*'//g" file
works too.

You need to put the dash first or last in the regex; a dash between two characters defines a character range, even when one of them is a backslash.
Similarly, to match a literal right square bracket, put it first (after any negation or dash). In traditional regex, a backslash is just a literal backslash in a character range, and you disambiguate by putting any special characters (dash, square brackets) first or last.
Oh, and lose the Useless Use of cat;
sed "s/\'[-][[:alnum:];]+\'//g" file.txt
Do you really need to replace multiple occurrences per line? If not, the /g flag is superfluous (but mostly harmless).

Related

Using Variables with Regex that contain a space (\s) and sed

Im trying to create a sort script using literal string variables and Regex and a sort using sed in bash. I cannot seem to find the liternal strings with spaces when using variables, although can find them when using the regex directly. So :
#!/bin/bash
group1="IRISHFHD"
group2="REGIONAL FHD"
sed -i '/group-title="'${group1}/',+1d' JWLINE.m3u
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
Ive tried adding \s into the group variable but it doesnt work.
John
The problem has nothing to do with regex, it's all down to how the shell treats variables' values. When a variable is expanded without double-quotes around it (i.e. ${group2}), the shell will split it into "words" based on whitespace. It'll also try to expand any words that contain shell wildcards into lists of matching files, and several regex metacharacters look like shell wildcards, which can cause serious chaos.
In this example:
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
It's a little more complicated, because the variable reference is in between two single-quoted sections. In this case, the part before the variable reference gets attached to the first "word" in the variable, and the part after gets attached to the last word. Essentially, it expands into the equivalent of this:
sed -i '/group-title="REGIONAL' 'FHD/,+1d' JWLINE.m3u
^ That's a space between arguments
Anyway, since it gets split on the whitespace, sed gets two partial arguments instead of one whole one, and it doesn't work at all.
Solution: as in almost all situations, you should have double-quotes around the variable reference to prevent weird effects like this. There are a few options for this. You could just add double-quotes around the variable part:
sed -i '/group-title="'"${group2}"/',+1d' JWLINE.m3u
...but IMO this is confusing; some of those quotes are syntactic (i.e. parsed by the shell), and one is literal (passed to sed as part of the regex), and it's not obvious which are which. I'd prefer to just use double-quotes around the whole thing, and escape the double-quote that's supposed to be literal:
sed -i "/group-title=\"${group2}/,+1d" JWLINE.m3u
^^ Escape makes this " a literal part of the argument.
(In double-quotes, you'd also need to escape any dollar signs, backslashes, or backticks that were supposed to be literal parts of the argument. But in this case, there aren't any of those.)

reuse last matched character of regex in sed

Many of you with a certain leaning towards proper formatting will know the pain of having a lot of space characters insted of a tab character in the beginning of indented lines after another person edited a file and added lines. I seem to be unable to teach my colleagues how to use vim's integrated line pasting function, so I'm searching for some simple ways to automatically correct lines beginning with a certain pattern. ;)
I'm using a regex to find the corresponding lines, but I can't work out how to "reuse" the last matched character in sed when using "find and replace". The regex matching the lines is
'^\ *[A-Z]'
I would like to replace those space characters, but keep the uppercase letter. My idea would be something like
sed 's|^\ *[A-Z]|\t$|g'
or so, but I guess that would replace the whole line with a single tab character since $ usually matches the line ending?
Is there a simple way to reuse parts of the matched regex in sed?
How about simply not including the first non-space character in the match in the first place?
This matches all spaces at the beginning of a line:
^ *
Edit (quote from the comments):
obviously I don't want to replace spaces in front of other characters than uppercase letters
A look-ahead could do that, but unfortunatey sed does not support them. But you can use the next best thing, an expression that determines which lines sed operates on:
sed '|^ *[A-Z]| s|^ *|\t|'
Of course a back-reference would do it as well:
sed 's|^ *\([A-Z]\)|\t\1|'

Escaping backslashes where they appear in a file when they don't always appear non-escaped

I've got a peculiar situation, I'm trying to import a CSV file into Weka and I've run into problems with Weka's apparently extremely poor ability to handle strings in a sanitary manner.
I'm already using sed to remove all non-ASCII characters but now I've run into a problem dealing with backslashes. The input I have contains escaped backslashes in some fields and non-escaped backslashes(which Weka cannot handle correctly) in others.
What I need is a regular expression that will find backslashes that are not preceded or followed by a backslash and add a second backslash. I'm having a real hard time making the syntax work and was wondering if someone could help me out.
Try following - sed 's/\\\\/##_#/g; s/\\/\\\\/g; s/##_#/\\\\/g'
Its replacing escaped backslashes with a token first, escape single backslashes and change tokens back to escaped backslashes.
Select a token that's not going to exist in the file.
echo 'asdfj\lasdf\\asldf\oweur\\lasjd;lf\\lasjfl\asdfsdf' | \
sed 's/\\\\/##_#/g; s/\\/\\\\/g; s/##_#/\\\\/g'
Results:
asdfj\\lasdf\\asldf\\oweur\\lasjd;lf\\lasjfl\\asdfsdf
Another option - sed 's/\([^\\]\)\(\\\)\([^\\]\)/\1\\\\\3/g'

How to read this command to remove all blanks at the end of a line

I happened across this page full of super useful and rather cryptic vim tips at http://rayninfo.co.uk/vimtips.html. I've tried a few of these and I understand what is happening enough to be able to parse it correctly in my head so that I can possibly recreate it later. One I'm having a hard time getting my head wrapped around though are the following two commands to remove all spaces from the end of every line
:%s= *$== : delete end of line blanks
:%s= \+$== : Same thing
I'm interpreting %s as string replacement on every line in the file, but after that I am getting lost in what looks like some gnarly variation of :s and regex. I'm used to seeing and using :s/regex/replacement. But the above is super confusing.
What do those above commands mean in english, step by step?
The regex delimiters don't have to be slashes, they can be other characters as well. This is handy if your search or replacement strings contain slashes. In this case I don't know why they use equal signs instead of slashes, but you can pretend that the equals are slashes:
:%s/ *$//
:%s/ \+$//
Does that make sense? The first one searches for a space followed by zero or more spaces, and the second one searches for one or more spaces. Each one is anchored at the end of the line with $. And then the replacement string is empty, so the spaces are deleted.
I understand your confusion, actually. If you look at :help :s you have to scroll down a few pages before you find this note:
*E146*
Instead of the '/' which surrounds the pattern and replacement string, you
can use any other character, but not an alphanumeric character, '\', '"' or
'|'. This is useful if you want to include a '/' in the search pattern or
replacement string. Example:
:s+/+//+
I do not know vim syntax, but it looks to me like these are sed-style substitution operators. In sed, the / (in s/REGEX/REPLACEMENT/) can be uniformly replaced with any other single character. Here it appears to be =. So if you mentally replace = with /, you'll get
:%s/ *$//
:%s/ \+$//
which should make more sense to you.

Match single character between Start string and End string

I can't seem to understand regular expression at all. How can I match a character which resides between a START and END string. For Example
#START-EDIT
#ValueA=0
#ValueB=1
#END-EDIT
I want to match any # which is between the #START-EDIT and #END-EDIT.
Specifically I want to use sed to replace the matches # values with nothing (delete them) on various files which may or may not have multiple START-EDIT and END-EDIT sections.
^#START-EDIT.*(#) *. *#END-EDIT$
sed is line based. you can easily search, replace based on regex in one line. But there is no really easy way to search/replace on multilines. AWK might do the trick.
If you have the regex on one line, the following command could be what you are looking for
sed -e "/^#START-EDIT.*#END-EDIT$//" myInput.txt