How can I remove hashes from inside a string?

How can I remove hashes from inside a string? - regex

I want to transform a line that looks like this:
any text #any text# ===#text#text#text#===#
into:
any text #any text# ===#texttexttext===#
As you can see above I want to remove the # between ===# and ===#
The number of # that are supposed to be removed can be any number.
Can I do this with sed?

Give this a try:
sed 'h;s/[^=]*=*=#\(.*\)/\1/;s/\([^=]\)#/\1/g;x;s/\([^=]*=\+#\).*/\1/;G;s/\n//g' inputfile
It splits the line in two at the first "=#", then deletes all "#" that aren't preceded by an "=", then recombines the lines.
Let me know if there are specific cases where it fails.
Edit:
This version, which is increasingly fragile, works for your new example as well as the original:
sed 'h;s/[^=]*=[^=]*=*=#\(.*\)$/\1/;s/\([^=]\)#/\1/g;x;s/\([^=]*=[^=]*=\+#\).*/\1/;G;s/\n//g' inputfile

sed uses the GNU BRE engine (GNU Basic Regular Expressions), which doesn't have many features that "newer" regex engines have, such as lookaround which would be very handy in solving this.
I'd say you'd have to first match ===#\(.\+\)===# (note that GNU BRE use backslashes to denote capturing groups and quantifiers, and also does not support lazy quantifiers). Then remove any # found in the captured group (a literal search/replace would be enough), and then put the result back into the string. But I'm not a Unix guy, so I don't know if/how that could be done in sed.

Related

Regex to extract first 3 words from a string

I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?

Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);

EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)

Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.

Is there a smarter way to keep the indentation when replacing characters with a regex?

I want to replace the asterisks in a Markdown list with hyphens.
Example:
1.0
1.1
1.2
2
2.1
2.2
Currently I have a separate regex pattern for up to three levels of indentation set up in Keyboard Maestro for Mac:
I wonder if there isn't a smarter way to do this and which adresses all kinds of indentation.

In many regular expression search and replace systems, you can refer to a parenthesized group in the regular expression in the replacement, using \1, \2, etc. to refer to each successive group. So for example, in sed you could do:
sed -e 's/\(^[\t ]*\)\*/\1-/'
I'm not sure if Keyboard Maestro gives you that option. It mentions that it uses ICU regular expressions; if it also uses their replacement options, then you can use $1, $2 etc. to refer to the replacement.
If not, all is not lost. You can use a lookbehind assertion to match the sequence of whitespace before the the asterisk, without including the asterisk as part of the match; then just use a single dash as your replacement:
Search for: (?<=^[\t ]*)\*
Replace with: -

You can use submatching groups and reference them in the replacing string like this:
Regular expression matching your lines with list items: ([\t ]*)\*(.*)
The string used for replacement: \1-\2

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?

You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.

With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

more robust regular expression lookaround

This is the input string: $table_prefix = 'wp5t3s1tc_'; which is part of a larger config file.
I want to match anything between the ''
The expression I have working is (?<=\$table_prefix(\s{2}=\s\'))(.*)?(?=\') which is not great because of the brittle way the lookaround works with the whitespace character either side of the =. If the config file changes with multiple spaces either side of the = then the expression won't work.
I am thinking it should look more like (?<=\$table_prefix(\s*\=\s*\'))(.*)?(?=\') but that of course does not work.
Could someone briefly explain a more elegant way of doing this match?

Here's a possible solution using grep. It is not very elegant, but it should be robust if you are concerned about variable spaces around the =.
Since variable length assertions are not allowed in grep, AFAIK, the only thing I can think of is to perform the extraction in two stages:
grep -oP '(?<=\$table_prefix).*(?='"'"')' file_name | grep -oP '(?<='"'"').*'
I'm basically capturing all the spaces around the = first, along with 'wp5t3s1tc_, and then extracting everything after the '. The weird '"'"' is to escape the single quote character.
Or you could use sed instead of the second grep:
grep -oP '(?<=\$table_prefix).*(?='"'"')' file_name | sed 's/ *= *'"'"'//'

You don't need to use lookaround at all as long you are guaranteed that the ' character won't appear in the sequence you are trying to match. You can use greedy search with complementary regular set, which will result in finite automata that will match greedily any string that will not contain the ' character.
To parse only the subsequence in the single quotes, use named groups (or unnamed groups if your engine does not support that. In this case, you will have to access the group by it's index instead of given name).
This regular expression does what you seek:
\$table_prefix\s*=\s*'(?<match>[^'.]*)';
Check with http://rubular.com/

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

How do I say "is not" a certain character in sed?

[^x]
This is a character class that accepts any character except x.

For those not satisfied with the selected answer as per johnny's comment.
'su[^x]' will match 'sum' and 'sun' but not 'su'.
You can tell sed to not match lines with x using the syntax below:
sed '/x/! s/su//' file
See kkeller's answer for another example.

There are two possible interpretations of your question. Like others have already pointed out, [^x] matches a single character which is not x. But an empty string also isn't x, so perhaps you are looking for [^x]\|^$.
Neither of these answers extend to multi-character sequences, which is usually what people are looking for. You could painstakingly build something like
[^s]\|s\($\|[^t]\|t\($\|[^r]\)\)\)
to compose a regular expression which doesn't match str, but a much more straightforward solution in sed is to delete any line which does match str, then keep the rest;
sed '/str/d' file
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").

From my own experience, and the below post supports this, sed doesn't support normal regex negation using "^". I don't think sed has a direct negation method...but if you check the below post, you'll see some workarounds.
Sed regex and substring negation

In addition to all the provided answers , you can negate a character class in sed , using the notation [^:[C_CLASS]:] , for example , [^[:blank:]] will match anything which is not considered a space character .

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How can I remove hashes from inside a string? - regex

I want to transform a line that looks like this: any text #any text# ===#text#text#text#===# into: any text #any text# ===#texttexttext===# As you can see above I want to remove the # between ===# and ===# The number of # that are supposed to be removed can be any number. Can I do this with sed?

Related

Regex to extract first 3 words from a string

Is there a smarter way to keep the indentation when replacing characters with a regex?

How do I write a SED regex to extract a string delimited by another string?

more robust regular expression lookaround

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

Categories

Resources