First of all, let me please clarify that I know absolutely nothing about regular expressions, but I need to write a "Tagger Script" for MusicBrainz Picard so that it doesn't mess with the way I format certain aspects of my tracks' titles.
Here's what I need to do:
- Find all sub-strings inside parenthesis
- Then, for those matches that meet a given criteria and those matches only, change the parentheses to brackets
For example, consider this string:
DJ Fresh - Louder (Sian Evans) (Flux Pavilion & Doctor P Remix)
It needs to be changed like so:
DJ Fresh - Louder (Sian Evans) [Flux Pavilion & Doctor P Remix]
The condition is that if the string within the parentheses contains the sub-string "dj" or "mix" or "version" or "inch", etc... then the parentheses surrounding it need to be changed to brackets.
So, the question is:
Is it possible to create a single regex expression that can perform this operation?
Thank you very much in advance.
Assuming there are no nested brackets, you can use the following regex to search for the text:
(?i)\((?=[^()]*(?:dj|mix|version|inch))([^()]+)\)
Note that the regex is case-insensitive, due to (?i) in front - make it case-sensitive by removing it.
Check the syntax of your language to see if you can use r prefix, e.g. r'literal_string', to specify literal string.
And use the following as replacement:
[$1]
You can include more keywords by adding keywords to (?:dj|mix|version|inch) part, each keyword separated by |. If the keyword contains (, ), [, ], |, ., +, ?, *, ^, $, \, {, } you need to escape them (I'm 99% sure the list is exhaustive). An easier way to think about it is: if the keyword only contains space and alphanumeric (but note that the number of spaces is strict), you can add them into the regex without causing side-effect.
Dissecting the regex:
(?i): Case-insensitive mode
\(: ( is special character in regex, need to escape it by prepending \.
(?=[^()]*(?:dj|mix|version|inch)): Positive look-ahead (?=pattern):
[^()]*: I need to check that the text is within bracket, not outside or in some other bracket, so I use a negated character class [^characters] to avoid matching () and spill outside the current bracket. The assumption I made also comes into play a bit here.
(?:dj|mix|version|inch): A list of keywords, in a non-capturing group (?:pattern). | means alternation.
([^()]+): The assumption about no nested bracket makes it easier to match all the characters inside the bracket. The text is captured for later replacement, since (pattern) is capturing group, as opposed to (?:pattern).
\): ) is special character in regex, need to escape it by prepending \.
Related
I cant figure this out. I want to capture the string inside the square brackets, with or without characters in it.
[5123512], [412351, 1235123, 5125123], [12312-AA] and []
i want to convert the square brackets into double quote
[5123512] ==> "5123512"
[412351, 1235123, 5125123] ==> "412351, 1235123, 5125123"
[12312-AA] ==> "12312-AA"
[] == > ""
i tried this \[\d+\] and not working
This is my sample data, its a json format.
Square brackets inside the description need not to change, only the attributes.
{"results":
[{"listing": 4613456431,"sku": [5123512],"category":[412351, 1235123,
5125123],"subcategory": "12312-AA", "description":"This is [123sample]"}
{"listing": 121251,"sku":[],"category": [412351],"subcategory": "12312-AA",
"description": "product sample"}]}
TIA
Your regex doesn't work for three reasons :
[ is a meta-character that opens a character class. To match a literal [, you need to escape it with a backslash. ] also is a meta-character when it follows the [ meta-character, but if you escape the [ you shouldn't need to escape the ] (not that it hurts to do so).
\d only captures decimal digits, however your sample contains the letter A. If that's the hexadecimal digit, you will probably want to use [\dA-F] instead of \d, or [\dA-Fa-f] if the digits can be found in small case. If that can be any letter, you could use [\dA-Z] or [\dA-Za-z] depending on your need to match small case letters.
+ means "one or more occurences", so it wouldn't match an empty []. Use the * "0 or more occurences" quantifier instead.
Additionally, you probably need to capture the sequence of digits in a (capturing group) in order to be able to reference it in your replacement pattern.
However, as Andrew Morton suggests, it looks like you should be able to use a plain text search/replace.
First off, regex is a horrible tool for parsing JSON formatted data. I'm sure you'll find plenty of tools to simply read your JSON in vb.net and mangle it in simpler ways than taking it in as text... For example: How to parse json and read in vb.net
Original answer (edited slightly):
You're almost there, but here's a few things you need to change:
in your regex pattern, escape the square brackets: \[ and \]
if you only want to capture all characters in the brackets, then . is a good way to go
the plus sign + means "at least one" — if you want to match empty brackets too, use *? instead
the question mark means "lazy" — it explicitly tells the regex to match the shortest sequence of characters possible (instead of going over to the next square bracket...)
wrap the .*? into parenthesis so that you can reference to that part later when substituting the stuff
finally, the output value / pattern to substitute with is \1 or $1, depending on the context
or "\1" or "$1" if you really need the double quotes in the output — maybe you just need a string variable?
All in all this becomes:
Find this: \[(.*?)\]
Replace with: \1
Maybe its trivial questions but I have problem with it. I have following string:
,a1a,1a1,11,,aaa,,,a,84.34,"",ssd
I want to achieve following effect by using regex:
"","a1a","1a1",11,"","aaa","","","a",84.34,"","ssd"
So I want to everything between commas was surrounded quotes, except integers and floating point numbers. How to do this using regex?
(*SKIP)(*F) Magic
In the demo, have a look at the replacements at the bottom.
This is a great task for preg_replace, because PCRE (the regex engine used by PHP) has a beautiful feature to skip certain content.
You can do it in one step with this lovely regex (see demo):
((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))
Explanation
The outside parentheses capture everything to Group 1.
There are two parts to the regex, on each side of the | OR
The left side of the alternation | uses \d+(?:\.\d+)? to match these floats and integers you don't want. We use the lookbehind (?<=^|,) to make sure there is a comma behind (or the beginning of the string), and the (?:(?=,)|$) to check that what follows is a comma or the end of the string. After matching, we deliberately fail, after which the engine skips to the next position in the string.
The right side uses [^,]* to match anything that is not a comma, including an empty sring, and we know it is the right content because it was not matched by the expression on the left. Again, we use lookarounds to check our position.
The replacement string '"\1"' embeds our match into double quotes.
How to use it in code:
$regex = "~((?<=^|,)\d+(?:\.\d+)?(?:(?=,)|$)(*SKIP)(*F)|(?<=^|,)[^,]*(?:(?=,)|$))~";
$replaced = preg_replace($regex,'"\1"',$string);
Here's another variant:
$regex = '/(?<![^,])(?!"[^"]*")(?![-+]?[0-9]*\.?[0-9]+\b)[^,]*+(?![^,])/';
$result = preg_replace($regex, '"$0"', $subject);
In somewhat more readable form:
(?<![^,])
(?!
"[^"]*"
|
[-+]?[0-9]*\.?[0-9]+\b
)
[^,]*+
(?![^,])
The major points of interest are:
The negative lookbehind (?<![^,]) to match the leading delimiter (or absence thereof). You can read it as if there's a character before this position, it must not be non-comma. It isn't always possible to use this idiom, but I like it because it feels less clumsy than the more common (?<=^|,), and it doesn't waste a capturing group like the (^|,) idiom.
The negative lookahead (?![^,]) similarly acts as the ending anchor.
In the lookahead to prevent it matching already-quoted fields, I'm assuming I don't have to worry about escaped quotes. Those are easy enough to deal with, but first you need to know whether it uses backslashes ("a\"b\"c") or quotes ("a""b""c") to escape them.
The negative lookahead to prevent it matching a number uses a regex from RegexBuddy's library, and it's the loosest of several such regexes. If you need something more precise, it's available.
the regex expression is as below:
if ($ftxt =~ m|/([^=]+)="(.+)"|o)
{
.....
}
this regex seems different from many other regex.What makes me confused is the "|" ,most regex use "/" instead of "|". And , group ([^=]+) also makes me confused.I know [^=] means "the start of the string" or "=",but what does it mean by repeat '^' one or more times? ,how to explain this?
You can use different delimiters instead of /. For instance you could use:
m#/([^=]+)="(.+)"#o
Or
m~/([^=]+)="(.+)"~o
The advantage here of using something different than / is that you don't have to escape slashes, because otherwise, you'd have to use:
m/\/([^=]+)="(.+)"/o
^
[Or [/]]
([^=]+) is a capture group, and inside, you have [^=]+. [^=] is a negated class and will match any character which is not a =.
^ behaves differently at the beginning of a character class and is not the same as ^ outside a character class which means 'beginning of line'.
As for the last part o, this is a flag which I haven't met so far so a little search brought me to this post, I quote:
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.
Some regexp implementations allow you to use other special characters besides / as the delimiter. This is useful if you need to use that special character inside the regular expression itself, since you don't have to escape it. (In and of itself / is not a special character in regexp syntax, but it needs escaping if it's used in the regexp literal syntax of the host language.) The docs on Perl's quote operators mention this.
This is tutorial-level stuff: square brackets ([abc]) denote a character class - it means "any of the characters inside the brackets". (In my example, it means "either a or b or c.) Inside them, the ^ special character has a different meaning, it inverts the character class. So, [^=] means "any character except =", and [^=]+ means "one or more characters that aren't =".
Quoting the docs on Perl's RE syntax:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
It is meant to match equation like expressions, to capture the key and values separately. Imagine you have a statement like height="30px", and you want to capture the height attribute name, as well as its value 30px.
So you have m|/([^=]+)="(.+)"|.
The key is supposed to be everything before the = is encountered. So [^=] captures it. The ^ is a negation metacharacter when used as the first character inside [] brackets. It means that it will match any character except =, which is what you want. The / is probably a mistake, if you need to capture the group, you should not use it, or if it is indeed intended, it means to literally match an opening parentheses. Since it is a special character, it needs to be escaped, that's why \(. if you mean to capture the group, it should be ([^=]+).
Next comes the = sign, which you don't care about. Then the quotes which contain the value. So you capture it like "(.+)". the .+ will go on matching greedily every character, including the final ". But then it will find that it can't match the final " in the regex, so it will backtrack, give up the last " the regex (.+) captured, so that leaves the string within the quotes to be captured in the group. Now you are ready to access the key and value through $1 and $2. Cool, isn't it?
I'm fairly new to regex and trying to figure out a pattern that will only match the instance of the word inside my custom tag.
In the example below both words match the condition of being after a | and before a ]
Pattern: (?=|)singleline(?=.*])
Sample: [if #sample|singleline second] <p>Another statement singleline goes here</p> [/if]
words that match the condition of being after a | and before a ]
the .*, which means "anything, zero or more times, and be greedy about it", will race to the end of the string and back up only enough to get to a ] (the last one). (and your lookbehind is a lookahead):
if you really want to match what you say you want to match (see quote), then this is it:
Pattern: (?<=|)(\w+)(?=])
Edit: or this one if you want to "match alphanumerics and spaces inside | and ]":
Pattern: (?<=|)([\w\s]+?)(?=])
(?=|) asserts that the next thing in the string either nothing or nothing. That will always evaluate to true; it's always possible to match nothing. I think sweaver2112 is correct that you meant to use a lookbehind there, but you also need to escape the pipe: (?<=\|). Or just match a pipe in the normal way; I don't see any need to use lookarounds for that part.
The other part probably does need to be a lookahead, but you need to expand it a bit. You want to assert that the word is followed by a closing bracket, but not if there's an opening bracket first. Assuming the brackets are always correctly paired, that should mean the word is between a pair of them. Like this:
Pattern: \|singleline(?=[^\]\[]*\])
[^\]\[]*\] matches zero or more of any characters except ] or [, followed by a ]. The backslashes escaping the "real" brackets may or may not be necessary depending on the regex flavor, but escaping them is always safe.
I am a regex supernoob (just reading my first articles about them), and at the same time working towards stronger use of vim. I would like to use a regex to search for all instances of a colon : that are not followed by a space and insert one space between those colons and any character after them.
If I start with:
foo:bar
I would like to end with
foo: bar
I got as far as %s/:[a-z] but now I don't know what do for the next part of the %s statement.
Also, how do I change the :[a-z] statement to make sure it catches anything that is not a space?
:%s/:\(\S\)/: \1/g
\S matches any character that is not whitespace, but you need to remember what that non-whitespace character is. This is what the \(\) does. You can then refer to it using \1 in the replacement.
So you match a :, some non-whitespace character and then replace it with a :, a space, and the captured character.
Changing this to only modify the text when there's only one : is fairly straight forward. As others have suggested, using some of the zero-width assertions will be useful.
:%s/:\#!<:[^:[:space:]]\#=/: /g
:\#!< matches any non-:, including the start of the line. This is an important characteristic of the negative lookahead/lookbehind assertions. It's not requiring that there actually be a character, just that there isn't a :.
: matches the required colon.
[^:[:space:]] introduces a couple more regex concepts.
The outer [] is a collection. A collection is used to match any of the characters listed inside. However, a leading ^ negates that match. So, [abc123] will match a, b, c, 1, 2, or 3, but [^abc123] matches anything but those characters.
[:space:] is a character class. Character classes can only be used inside a collection. [:space:] means, unsurprisingly, any whitespace. In most implementations, it relates directly to the result of the C library's isspace function.
Tying that all together, the collection means "match any character that is not a : or whitespace".
\#= is the positive lookahead assertion. It applies to the previous atom (in this case the collection) and means that the collection is required for the pattern to be a successful match, but will not be part of the text that is replaced.
So, whenever the pattern matches, we just replace the : with itself and a space.
You want to use a zero-width negative lookahead assertion, which is a fancy way of saying look for a character that's not a space, but don't include it in the match:
:%s/: \#!/: /g
The \#! is the negative lookahead.
An interesting feature of Vim regex is the presence of \zs and \ze. Other engines might have them too, but they're not very common.
The purpose of \zs is to mark the start of the match, and \ze the end of it. For example:
ab\zsc
matches c, only if before you have ab. Similarly:
a\zebc
matches a only if you have bc after it. You can mix both:
a\zsb\zec
matches b only if in between a and c. You can also create zero-width matches, which are ideal for what you're trying to do:
:%s/:\zs\ze\S/ /
Your search has no size, only a position. And them you substitute that position by " ". By the way, \S means any character but white space ones.
:\zs\ze\S matches the position between a colon and something not a space.
you probably want to use :[^ ] to mach everything except spaces. As mentioned by Matt this will cause your replace to replace the extra character.
There are several ways to avoid this, here are 2 that I find useful.
1) Surround the last part of the search term with parenthesis \(\), this allows you to reference that part of the search in your replace term with a /1.
Your final replace string should look like this:
%s/:\([^ ]\)/: \1/g
2) end the search term early with \ze This will means that the entire search term must be met for a match, but only the part before \ze will be higlighted / or replaced
Your final replace string should look like this:
%s/:\ze[^ ]/: /g