What's the meaning of this perl regex expression? - regex

the regex expression is as below:
if ($ftxt =~ m|/([^=]+)="(.+)"|o)
{
.....
}
this regex seems different from many other regex.What makes me confused is the "|" ,most regex use "/" instead of "|". And , group ([^=]+) also makes me confused.I know [^=] means "the start of the string" or "=",but what does it mean by repeat '^' one or more times? ,how to explain this?

You can use different delimiters instead of /. For instance you could use:
m#/([^=]+)="(.+)"#o
Or
m~/([^=]+)="(.+)"~o
The advantage here of using something different than / is that you don't have to escape slashes, because otherwise, you'd have to use:
m/\/([^=]+)="(.+)"/o
^
[Or [/]]
([^=]+) is a capture group, and inside, you have [^=]+. [^=] is a negated class and will match any character which is not a =.
^ behaves differently at the beginning of a character class and is not the same as ^ outside a character class which means 'beginning of line'.
As for the last part o, this is a flag which I haven't met so far so a little search brought me to this post, I quote:
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.

Some regexp implementations allow you to use other special characters besides / as the delimiter. This is useful if you need to use that special character inside the regular expression itself, since you don't have to escape it. (In and of itself / is not a special character in regexp syntax, but it needs escaping if it's used in the regexp literal syntax of the host language.) The docs on Perl's quote operators mention this.
This is tutorial-level stuff: square brackets ([abc]) denote a character class - it means "any of the characters inside the brackets". (In my example, it means "either a or b or c.) Inside them, the ^ special character has a different meaning, it inverts the character class. So, [^=] means "any character except =", and [^=]+ means "one or more characters that aren't =".
Quoting the docs on Perl's RE syntax:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.

It is meant to match equation like expressions, to capture the key and values separately. Imagine you have a statement like height="30px", and you want to capture the height attribute name, as well as its value 30px.
So you have m|/([^=]+)="(.+)"|.
The key is supposed to be everything before the = is encountered. So [^=] captures it. The ^ is a negation metacharacter when used as the first character inside [] brackets. It means that it will match any character except =, which is what you want. The / is probably a mistake, if you need to capture the group, you should not use it, or if it is indeed intended, it means to literally match an opening parentheses. Since it is a special character, it needs to be escaped, that's why \(. if you mean to capture the group, it should be ([^=]+).
Next comes the = sign, which you don't care about. Then the quotes which contain the value. So you capture it like "(.+)". the .+ will go on matching greedily every character, including the final ". But then it will find that it can't match the final " in the regex, so it will backtrack, give up the last " the regex (.+) captured, so that leaves the string within the quotes to be captured in the group. Now you are ready to access the key and value through $1 and $2. Cool, isn't it?

Related

Check array syntax with Regex

I'm trying to create a regex that checks if a string is a valid path for Firestore document.
I will find a regex that testing if a string:
start with a char ^([a-z]{1})
after first char, there will be only letter/digit and/or a dot \w*(.?\w+){0,}
last chars in the string could be an index of an array (\[{1}\d+\]{1})?$
First and second points work well but the last group doesn't work. I test a string like data.images[11 and the regex return true.
first of all you can shorten some quantifiers in your regex:
{1} -> can be ignored completely
{0,} -> *
Your second part could be expressed like this, this will also support readability:
[\w.]* meaning: take any character inside the brackets 0 to n-times. The bracket expression also supports predefined classes, so we are using \w here. The dot INSIDE the brackets doesn't need to be escaped, it simply means the one character dot.
So your parts would be:
^([a-z])
[\w.]*
(\[\d+\])?$
I hope this helps. According to regexpal it matches data.images[11], but not data.images[11. Also it seems to support all your demands.
EDIT:
Your second part doesn't work because (like Asocia stated in the answer) you would need to escape the dot. The dot itself is a class meaning "any character" (depending on regex engine and settings sometimes even line breaks). As you mean the dot as a character you need to escape it.

Regular expression to match single quotes, double quotes and/or space

I have a regular expression looking for width=["|\']([^"]*)["|\']
works great when looking for width="750" and width='750' however it does not match width=750
so I got it as far as width=["|\']?([^"]*)["|\'] for optional first quote but the match just continues on and does not return just 750
If you are using a tool or language that supports backreferences, you should be able to use the following:
width=("|'|)(\S*)\1
This will try to match a single quote, double quote, or empty string with the first capture group, and then the \1 at the end will be whatever the first group captured. The value will always be the contents from the second capture group.
I also changed the [^"]* to \S* so this will match any number of non-whitespace characters. This is necessary to make sure that your match doesn't just go to the end of the string when there is no quotes around the value.
Example: http://rubular.com/r/Xg8ageZmgy
Character classes ([]) do not make use of | to mean or; they automatically or everything. You also don't have to escape the single quote (unless of course you're enclosing this whole expression in single quotes). You want:
["' ]?([^"' ]*)["' ]
Try this one:
width\s*=\s*(?:["\']([^"\']*)["\']|\S+)
I just added the \S+ to handle 700 after equal sign as OR condition. Also you do not need to place | inside the character class []
\s* means optional white spaces(zero or more times).
Which regular expression language are you using? Different languages have different details of syntax, so someone might give you an answer that works in their environment but not in yours.
For example, I copied your expression and tried it on some text in Emacs. It found a match in this text:
width=|750|
That's because Emacs regex doesn't use the '|' character to signify "either or" within the '[' and ']' brackets; it interprets it as just one more example of a character that the expression might match.
Also, it looks like your expression doesn't always stop after the 750 in this example:
width='750'
Instead, if there is a '"' character later in the input, it matches everything from the 750 up to that character. (It did the same thing with my earlier example in Emacs if there was a '"' later in the input.)
You will also match the 750 in this (note the mismatched quotation marks):
width='750"
Is that a problem, or is that an acceptable outcome?

Perform substitution on regex results, but only on a given condition

First of all, let me please clarify that I know absolutely nothing about regular expressions, but I need to write a "Tagger Script" for MusicBrainz Picard so that it doesn't mess with the way I format certain aspects of my tracks' titles.
Here's what I need to do:
- Find all sub-strings inside parenthesis
- Then, for those matches that meet a given criteria and those matches only, change the parentheses to brackets
For example, consider this string:
DJ Fresh - Louder (Sian Evans) (Flux Pavilion & Doctor P Remix)
It needs to be changed like so:
DJ Fresh - Louder (Sian Evans) [Flux Pavilion & Doctor P Remix]
The condition is that if the string within the parentheses contains the sub-string "dj" or "mix" or "version" or "inch", etc... then the parentheses surrounding it need to be changed to brackets.
So, the question is:
Is it possible to create a single regex expression that can perform this operation?
Thank you very much in advance.
Assuming there are no nested brackets, you can use the following regex to search for the text:
(?i)\((?=[^()]*(?:dj|mix|version|inch))([^()]+)\)
Note that the regex is case-insensitive, due to (?i) in front - make it case-sensitive by removing it.
Check the syntax of your language to see if you can use r prefix, e.g. r'literal_string', to specify literal string.
And use the following as replacement:
[$1]
You can include more keywords by adding keywords to (?:dj|mix|version|inch) part, each keyword separated by |. If the keyword contains (, ), [, ], |, ., +, ?, *, ^, $, \, {, } you need to escape them (I'm 99% sure the list is exhaustive). An easier way to think about it is: if the keyword only contains space and alphanumeric (but note that the number of spaces is strict), you can add them into the regex without causing side-effect.
Dissecting the regex:
(?i): Case-insensitive mode
\(: ( is special character in regex, need to escape it by prepending \.
(?=[^()]*(?:dj|mix|version|inch)): Positive look-ahead (?=pattern):
[^()]*: I need to check that the text is within bracket, not outside or in some other bracket, so I use a negated character class [^characters] to avoid matching () and spill outside the current bracket. The assumption I made also comes into play a bit here.
(?:dj|mix|version|inch): A list of keywords, in a non-capturing group (?:pattern). | means alternation.
([^()]+): The assumption about no nested bracket makes it easier to match all the characters inside the bracket. The text is captured for later replacement, since (pattern) is capturing group, as opposed to (?:pattern).
\): ) is special character in regex, need to escape it by prepending \.

Regex to insert space in vim

I am a regex supernoob (just reading my first articles about them), and at the same time working towards stronger use of vim. I would like to use a regex to search for all instances of a colon : that are not followed by a space and insert one space between those colons and any character after them.
If I start with:
foo:bar
I would like to end with
foo: bar
I got as far as %s/:[a-z] but now I don't know what do for the next part of the %s statement.
Also, how do I change the :[a-z] statement to make sure it catches anything that is not a space?
:%s/:\(\S\)/: \1/g
\S matches any character that is not whitespace, but you need to remember what that non-whitespace character is. This is what the \(\) does. You can then refer to it using \1 in the replacement.
So you match a :, some non-whitespace character and then replace it with a :, a space, and the captured character.
Changing this to only modify the text when there's only one : is fairly straight forward. As others have suggested, using some of the zero-width assertions will be useful.
:%s/:\#!<:[^:[:space:]]\#=/: /g
:\#!< matches any non-:, including the start of the line. This is an important characteristic of the negative lookahead/lookbehind assertions. It's not requiring that there actually be a character, just that there isn't a :.
: matches the required colon.
[^:[:space:]] introduces a couple more regex concepts.
The outer [] is a collection. A collection is used to match any of the characters listed inside. However, a leading ^ negates that match. So, [abc123] will match a, b, c, 1, 2, or 3, but [^abc123] matches anything but those characters.
[:space:] is a character class. Character classes can only be used inside a collection. [:space:] means, unsurprisingly, any whitespace. In most implementations, it relates directly to the result of the C library's isspace function.
Tying that all together, the collection means "match any character that is not a : or whitespace".
\#= is the positive lookahead assertion. It applies to the previous atom (in this case the collection) and means that the collection is required for the pattern to be a successful match, but will not be part of the text that is replaced.
So, whenever the pattern matches, we just replace the : with itself and a space.
You want to use a zero-width negative lookahead assertion, which is a fancy way of saying look for a character that's not a space, but don't include it in the match:
:%s/: \#!/: /g
The \#! is the negative lookahead.
An interesting feature of Vim regex is the presence of \zs and \ze. Other engines might have them too, but they're not very common.
The purpose of \zs is to mark the start of the match, and \ze the end of it. For example:
ab\zsc
matches c, only if before you have ab. Similarly:
a\zebc
matches a only if you have bc after it. You can mix both:
a\zsb\zec
matches b only if in between a and c. You can also create zero-width matches, which are ideal for what you're trying to do:
:%s/:\zs\ze\S/ /
Your search has no size, only a position. And them you substitute that position by " ". By the way, \S means any character but white space ones.
:\zs\ze\S matches the position between a colon and something not a space.
you probably want to use :[^ ] to mach everything except spaces. As mentioned by Matt this will cause your replace to replace the extra character.
There are several ways to avoid this, here are 2 that I find useful.
1) Surround the last part of the search term with parenthesis \(\), this allows you to reference that part of the search in your replace term with a /1.
Your final replace string should look like this:
%s/:\([^ ]\)/: \1/g
2) end the search term early with \ze This will means that the entire search term must be met for a match, but only the part before \ze will be higlighted / or replaced
Your final replace string should look like this:
%s/:\ze[^ ]/: /g

What does /([^.]*)\.(.*)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.