I am a regex supernoob (just reading my first articles about them), and at the same time working towards stronger use of vim. I would like to use a regex to search for all instances of a colon : that are not followed by a space and insert one space between those colons and any character after them.
If I start with:
foo:bar
I would like to end with
foo: bar
I got as far as %s/:[a-z] but now I don't know what do for the next part of the %s statement.
Also, how do I change the :[a-z] statement to make sure it catches anything that is not a space?
:%s/:\(\S\)/: \1/g
\S matches any character that is not whitespace, but you need to remember what that non-whitespace character is. This is what the \(\) does. You can then refer to it using \1 in the replacement.
So you match a :, some non-whitespace character and then replace it with a :, a space, and the captured character.
Changing this to only modify the text when there's only one : is fairly straight forward. As others have suggested, using some of the zero-width assertions will be useful.
:%s/:\#!<:[^:[:space:]]\#=/: /g
:\#!< matches any non-:, including the start of the line. This is an important characteristic of the negative lookahead/lookbehind assertions. It's not requiring that there actually be a character, just that there isn't a :.
: matches the required colon.
[^:[:space:]] introduces a couple more regex concepts.
The outer [] is a collection. A collection is used to match any of the characters listed inside. However, a leading ^ negates that match. So, [abc123] will match a, b, c, 1, 2, or 3, but [^abc123] matches anything but those characters.
[:space:] is a character class. Character classes can only be used inside a collection. [:space:] means, unsurprisingly, any whitespace. In most implementations, it relates directly to the result of the C library's isspace function.
Tying that all together, the collection means "match any character that is not a : or whitespace".
\#= is the positive lookahead assertion. It applies to the previous atom (in this case the collection) and means that the collection is required for the pattern to be a successful match, but will not be part of the text that is replaced.
So, whenever the pattern matches, we just replace the : with itself and a space.
You want to use a zero-width negative lookahead assertion, which is a fancy way of saying look for a character that's not a space, but don't include it in the match:
:%s/: \#!/: /g
The \#! is the negative lookahead.
An interesting feature of Vim regex is the presence of \zs and \ze. Other engines might have them too, but they're not very common.
The purpose of \zs is to mark the start of the match, and \ze the end of it. For example:
ab\zsc
matches c, only if before you have ab. Similarly:
a\zebc
matches a only if you have bc after it. You can mix both:
a\zsb\zec
matches b only if in between a and c. You can also create zero-width matches, which are ideal for what you're trying to do:
:%s/:\zs\ze\S/ /
Your search has no size, only a position. And them you substitute that position by " ". By the way, \S means any character but white space ones.
:\zs\ze\S matches the position between a colon and something not a space.
you probably want to use :[^ ] to mach everything except spaces. As mentioned by Matt this will cause your replace to replace the extra character.
There are several ways to avoid this, here are 2 that I find useful.
1) Surround the last part of the search term with parenthesis \(\), this allows you to reference that part of the search in your replace term with a /1.
Your final replace string should look like this:
%s/:\([^ ]\)/: \1/g
2) end the search term early with \ze This will means that the entire search term must be met for a match, but only the part before \ze will be higlighted / or replaced
Your final replace string should look like this:
%s/:\ze[^ ]/: /g
Related
My organization has an in-house language, with syntax like:
cmo/create/mo1///tri
createpts/brick/xyz/2,2,2/0.,0.,0./1.,1.,1./1,1,1
I am writing a Vim syntax file, and would like to capture the first instance of a word enclosed by two characters (in this case, /), without capturing the characters themselves.
I.e., the regex would capture, from the lines above,
create
brick
My solution so far is to use this pattern:
[,/=" "].\{-}[,/=" "]
But from /this/and/this/and/this, it will capture /this/and/this/and/this/.
As you can see, the issue is two-fold: (i) my current solution is greedy, and (ii) captures the / characters as well, when I just want the words enclosed by these.
Thanks!
One possible solution:
^[^\/]\+\/\zs[^\/]\+\ze\/
^ anchor the search to the BOL,
[^\/]\+ one or more non-slash characters, as many as possible,
\/ a slash,
\zs start the match here,
[^\/]\+ one or more non-slash characters, as many as possible.
Not quite sure how to go about this, but basically what I want to do is match a character, say a for example. In this case all of the following would not contain matches (i.e. I don't want to match them):
aa
aaa
fooaaxyz
Whereas the following would:
a (obviously)
fooaxyz (this would only match the letter a part)
My knowledge of RegEx is not great, so I am not even sure if this is possible. Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
Basically what I want to do is match any single a that has any other non a character around it (except for the start and end of the string).
^[^\sa]*\Ka(?=[^\sa]*$)
DEMO
\K discards the previously matched characters and lookahead assertes whether a match is possibel or not. So the above matches only the letter a which satifies the conditions.
OR
a{2,}(*SKIP)(*F)|a
DEMO
You may use a combination of a lookbehind and a lookahead:
(?<!a)a(?!a)
See the regex demo and the regex graph:
Details
(?<!a) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a a char
a - an a char
(?!a) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a a char.
You need two things:
a negated character class: [^a] (all except "a")
anchors (^ and $) to ensure that the limits of the string are reached (in other words, that the pattern matches the whole string and not only a substring):
Result:
^[^a]*a[^a]*$
Once you know there is only one "a", you can use the way you want to extract/replace/remove it depending of the language you use.
Maybe its trivial questions but I have problem with it. I have following string:
,a1a,1a1,11,,aaa,,,a,84.34,"",ssd
I want to achieve following effect by using regex:
"","a1a","1a1",11,"","aaa","","","a",84.34,"","ssd"
So I want to everything between commas was surrounded quotes, except integers and floating point numbers. How to do this using regex?
(*SKIP)(*F) Magic
In the demo, have a look at the replacements at the bottom.
This is a great task for preg_replace, because PCRE (the regex engine used by PHP) has a beautiful feature to skip certain content.
You can do it in one step with this lovely regex (see demo):
((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))
Explanation
The outside parentheses capture everything to Group 1.
There are two parts to the regex, on each side of the | OR
The left side of the alternation | uses \d+(?:\.\d+)? to match these floats and integers you don't want. We use the lookbehind (?<=^|,) to make sure there is a comma behind (or the beginning of the string), and the (?:(?=,)|$) to check that what follows is a comma or the end of the string. After matching, we deliberately fail, after which the engine skips to the next position in the string.
The right side uses [^,]* to match anything that is not a comma, including an empty sring, and we know it is the right content because it was not matched by the expression on the left. Again, we use lookarounds to check our position.
The replacement string '"\1"' embeds our match into double quotes.
How to use it in code:
$regex = "~((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))~";
$replaced = preg_replace($regex,'"\1"',$string);
Here's another variant:
$regex = '/(?<![^,])(?!"[^"]*")(?![-+]?[0-9]*\.?[0-9]+\b)[^,]*+(?![^,])/';
$result = preg_replace($regex, '"$0"', $subject);
In somewhat more readable form:
(?<![^,])
(?!
"[^"]*"
|
[-+]?[0-9]*\.?[0-9]+\b
)
[^,]*+
(?![^,])
The major points of interest are:
The negative lookbehind (?<![^,]) to match the leading delimiter (or absence thereof). You can read it as if there's a character before this position, it must not be non-comma. It isn't always possible to use this idiom, but I like it because it feels less clumsy than the more common (?<=^|,), and it doesn't waste a capturing group like the (^|,) idiom.
The negative lookahead (?![^,]) similarly acts as the ending anchor.
In the lookahead to prevent it matching already-quoted fields, I'm assuming I don't have to worry about escaped quotes. Those are easy enough to deal with, but first you need to know whether it uses backslashes ("a\"b\"c") or quotes ("a""b""c") to escape them.
The negative lookahead to prevent it matching a number uses a regex from RegexBuddy's library, and it's the loosest of several such regexes. If you need something more precise, it's available.
I'm writing a perl script and part of it requires that I match all occurrences of a certain pattern in a string. Naturally, a regular expression seems like it would be powerful enough, but I just can't get it right for this particular string.
A hypothetical example of the type of text the regex might be applied to would be:
1cat;2dog;!3monkey;!4horse;
As you can see, several data entries (1cat, 2dog, etc.) are present in the line, delimited by semicolons. The beginning of the line contains no semicolon, but the end does. I want to be able to match all the stuff which hasn't been not'ed by the !. In the above example, 1cat and 2dog would be matched and returned in list context, while 3monkey and 4horse would not.
What I have tried to do so far is use negative lookbehinds to notice only the entries without a !. Something like this:
m/(?<!\!)(\w+)\;/g
However, doesn't work because the for every !'ed entry, the regex just matches what comes after it, up to the semicolon. In the example, 1cat and 2dog are captured, but then so are monkey and horse.
I feel like this is easily doable, but I'm new to regular expressions and I can't think of anything else.
Throw a word boundary (\b) in there and you should be good:
(?<!!)\b(\w+);
As you could tell your negative lookbehind was working, but it would still match everything after the next character (horse from !4horse). A word boundary is a zero-width assertion, kind of like a conditional that doesn't match anything (like anchors ^ and $). It asserts for this: (^\w|\w\W|\W\w|\w$). In other words, anytime a word character ([a-zA-Z0-9_]) is next to the beginning/end of string or a non-word character.
the regex expression is as below:
if ($ftxt =~ m|/([^=]+)="(.+)"|o)
{
.....
}
this regex seems different from many other regex.What makes me confused is the "|" ,most regex use "/" instead of "|". And , group ([^=]+) also makes me confused.I know [^=] means "the start of the string" or "=",but what does it mean by repeat '^' one or more times? ,how to explain this?
You can use different delimiters instead of /. For instance you could use:
m#/([^=]+)="(.+)"#o
Or
m~/([^=]+)="(.+)"~o
The advantage here of using something different than / is that you don't have to escape slashes, because otherwise, you'd have to use:
m/\/([^=]+)="(.+)"/o
^
[Or [/]]
([^=]+) is a capture group, and inside, you have [^=]+. [^=] is a negated class and will match any character which is not a =.
^ behaves differently at the beginning of a character class and is not the same as ^ outside a character class which means 'beginning of line'.
As for the last part o, this is a flag which I haven't met so far so a little search brought me to this post, I quote:
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.
Some regexp implementations allow you to use other special characters besides / as the delimiter. This is useful if you need to use that special character inside the regular expression itself, since you don't have to escape it. (In and of itself / is not a special character in regexp syntax, but it needs escaping if it's used in the regexp literal syntax of the host language.) The docs on Perl's quote operators mention this.
This is tutorial-level stuff: square brackets ([abc]) denote a character class - it means "any of the characters inside the brackets". (In my example, it means "either a or b or c.) Inside them, the ^ special character has a different meaning, it inverts the character class. So, [^=] means "any character except =", and [^=]+ means "one or more characters that aren't =".
Quoting the docs on Perl's RE syntax:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
It is meant to match equation like expressions, to capture the key and values separately. Imagine you have a statement like height="30px", and you want to capture the height attribute name, as well as its value 30px.
So you have m|/([^=]+)="(.+)"|.
The key is supposed to be everything before the = is encountered. So [^=] captures it. The ^ is a negation metacharacter when used as the first character inside [] brackets. It means that it will match any character except =, which is what you want. The / is probably a mistake, if you need to capture the group, you should not use it, or if it is indeed intended, it means to literally match an opening parentheses. Since it is a special character, it needs to be escaped, that's why \(. if you mean to capture the group, it should be ([^=]+).
Next comes the = sign, which you don't care about. Then the quotes which contain the value. So you capture it like "(.+)". the .+ will go on matching greedily every character, including the final ". But then it will find that it can't match the final " in the regex, so it will backtrack, give up the last " the regex (.+) captured, so that leaves the string within the quotes to be captured in the group. Now you are ready to access the key and value through $1 and $2. Cool, isn't it?