Regular expression to match single quotes, double quotes and/or space

Regular expression to match single quotes, double quotes and/or space - regex

I have a regular expression looking for width=["|\']([^"]*)["|\']
works great when looking for width="750" and width='750' however it does not match width=750
so I got it as far as width=["|\']?([^"]*)["|\'] for optional first quote but the match just continues on and does not return just 750

If you are using a tool or language that supports backreferences, you should be able to use the following:
width=("|'|)(\S*)\1
This will try to match a single quote, double quote, or empty string with the first capture group, and then the \1 at the end will be whatever the first group captured. The value will always be the contents from the second capture group.
I also changed the [^"]* to \S* so this will match any number of non-whitespace characters. This is necessary to make sure that your match doesn't just go to the end of the string when there is no quotes around the value.
Example: http://rubular.com/r/Xg8ageZmgy

Character classes ([]) do not make use of | to mean or; they automatically or everything. You also don't have to escape the single quote (unless of course you're enclosing this whole expression in single quotes). You want:
["' ]?([^"' ]*)["' ]

Try this one:
width\s*=\s*(?:["\']([^"\']*)["\']|\S+)
I just added the \S+ to handle 700 after equal sign as OR condition. Also you do not need to place | inside the character class []
\s* means optional white spaces(zero or more times).

Which regular expression language are you using? Different languages have different details of syntax, so someone might give you an answer that works in their environment but not in yours.
For example, I copied your expression and tried it on some text in Emacs. It found a match in this text:
width=|750|
That's because Emacs regex doesn't use the '|' character to signify "either or" within the '[' and ']' brackets; it interprets it as just one more example of a character that the expression might match.
Also, it looks like your expression doesn't always stop after the 750 in this example:
width='750'
Instead, if there is a '"' character later in the input, it matches everything from the 750 up to that character. (It did the same thing with my earlier example in Emacs if there was a '"' later in the input.)
You will also match the 750 in this (note the mismatched quotation marks):
width='750"
Is that a problem, or is that an acceptable outcome?

Related

Regex - Find everything between OR-operators except OR between quotes

I need some help with a Regex. I have a query, that should be splitted between all OR-operators. But if the OR is inside of quotes, it should not splitted.
Example:
This is the query:
"test1" OR "test2.1 OR test2.2" OR test3 OR test4:"test4.1 OR test4.2"
Expression 1: I need everything between the OR-operators or start/end of line... (This is not working)
(^|OR).*?(OR|$)
Expression 2: ...except of the ORs between quotes:
"(.*?)"
The result should be:
"test1"
"test2.1 OR test2.2"
test3
test4:"test4.1 OR test4.2"
How can I make the first expression work and how can I combine these both expressions?
Thank you for help!

It's unclear what the grammar of your expression is, so I just make a bunch of assumptions and come up with this regex to match the tokens between OR:
\G(\w+(?::"[^"]*")?|"[^"]*")(?:(\s+OR\s+)|\s*$)
Demo at regex101
I assume that between OR, it can be an identifier \w+, an identifier with some string \w+:"[^"]*", or a string literal "[^"]*".
Feel free to substitute your own definition of string literal - I'm using the simplest (and broken) specification "[^"]*" as example.
In every match, the regex starts from where the last match left off (or the beginning of the string) and matches one token (as described above), followed by OR or the end of the input string.
The capturing groups at (\s+OR\s+) is deliberate - you will need this to check whether the last match actually terminates at the end of the string or not, or whether the input is malformed.
Caveat
Do note that while my solution produces the expected result for this case, without a full specification of the grammar of the expression, it's not possible to cater for all possible cases you may want to handle.

(?:^|OR(?=(?:[^"]*"[^"]*")*+[^"]*$))([\s\S]*?)(?=OR(?=(?:[^"]*"[^"]*")*+[^"]*$)|$)
You can use this and capture the groups.See demo.
https://regex101.com/r/xC4rJ3/12

Try to match everything in quotes or not-OR with:
(?:"[^"]+"|\b(?:(?!\bOR\b)[^"])+)+
DEMO

This regex works optimally (though it be subject to improvement with a more detailed specification):
(?<!\S)(?!OR\s)[^\s"]*(?:"[^"]*"[^\s"]*)*
DEMO
(?<!\S) ensures the match starts at the beginning of the string or after a whitespace character.
(?!OR\s) prevents it from matching OR
[^\s"]*(?:"[^"]*"[^\s"]*)* matches a contiguous series of, in any order:
sequences of non-whitespace, non-quote characters, or
a pair of quotes enclosing anything except quotes.
However, I notice that all the tokens in your example consist of:
a non-quote, non-whitespace sequence (NQ),
a quoted sequence (Q), or
an NQ followed immediately by a Q.
If you expect all tokens to match that pattern, you can change the regex to this:
(?<!\S)(?!OR\s)(?:[^\s"]*"[^"]*"|[^\s"]+)
According to Regex101, it's slightly more efficient (but probably not enough to matter).
DEMO

Regular expression: page path starts with "/posts/" and ends with ".html"

I'm stuck here:
=~^/posts/(*).html
but it doesn't work
I need something that can recognise something like this:
/posts/testing.html
/posts/another-testing-issue.html
And I'm not very good using RegEx
Can anyone help me please?
EDIT:
Floris had the right answer:
^/posts/.*html$
thank you!

Briefly, the expression you need is
^\/posts\/.*\.html$
Explanation:
^ start of string
\/posts\/ literal string '/posts/'
the backslash "protects" the forward slash -
it is called "escaping", and removes any special meaning it might have
(in some applications the / would be a delimiter)
.* any number of characters
\. literal '.'
html literal 'html'
$ end of string
Now for a bit more background on regex syntax…
A
s #Peter points out in the comment, a quantifier follows "the thing to quantify". In most (all?) regex syntaxes, writing (*) will generate the error preceding token is not quantifiable. You need something in front of the *, and a ( doesn't count (unless it was escaped).
This is where the dot comes in. The dot . means "any character at all. That is its usual meaning, which is why.*` is just about the most common thing in regular expressions, meaning "I don't care about the next bit…" (usually up to an "until" - whatever follows).
Because the dot has a special meaning, when you want the exact string .html, you need to write it as \.html (there's that escape backslash again to remove the special meaning from the dot).
As a final tweak, it is not uncommon to have an extension like .htm - so you could write your expression as
\/posts\/.*\.html?$
This would make the last character, the l, optional (the ? means "zero or one times the preceding expression, which in this case is the single character immediately before it).
You can see this at work at http://regex101.com/r/bK5yC7 - it is a wonderful tool for exploring regular expressions, and also gives a nice explanation (breakdown) of every expression you type (with highlighting of any errors)

You missed a dot as single character match and didn't escape the second one as being literal:
^/posts/(.*)\.html

In most of regular expression . mean any character and * means multiplicity, so try to fix to
^/posts/(.*)\.html
\ is escape character

What's the meaning of this perl regex expression?

the regex expression is as below:
if ($ftxt =~ m|/([^=]+)="(.+)"|o)
{
.....
}
this regex seems different from many other regex.What makes me confused is the "|" ,most regex use "/" instead of "|". And , group ([^=]+) also makes me confused.I know [^=] means "the start of the string" or "=",but what does it mean by repeat '^' one or more times? ,how to explain this?

You can use different delimiters instead of /. For instance you could use:
m#/([^=]+)="(.+)"#o
Or
m~/([^=]+)="(.+)"~o
The advantage here of using something different than / is that you don't have to escape slashes, because otherwise, you'd have to use:
m/\/([^=]+)="(.+)"/o
^
[Or [/]]
([^=]+) is a capture group, and inside, you have [^=]+. [^=] is a negated class and will match any character which is not a =.
^ behaves differently at the beginning of a character class and is not the same as ^ outside a character class which means 'beginning of line'.
As for the last part o, this is a flag which I haven't met so far so a little search brought me to this post, I quote:
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.

Some regexp implementations allow you to use other special characters besides / as the delimiter. This is useful if you need to use that special character inside the regular expression itself, since you don't have to escape it. (In and of itself / is not a special character in regexp syntax, but it needs escaping if it's used in the regexp literal syntax of the host language.) The docs on Perl's quote operators mention this.
This is tutorial-level stuff: square brackets ([abc]) denote a character class - it means "any of the characters inside the brackets". (In my example, it means "either a or b or c.) Inside them, the ^ special character has a different meaning, it inverts the character class. So, [^=] means "any character except =", and [^=]+ means "one or more characters that aren't =".
Quoting the docs on Perl's RE syntax:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.

It is meant to match equation like expressions, to capture the key and values separately. Imagine you have a statement like height="30px", and you want to capture the height attribute name, as well as its value 30px.
So you have m|/([^=]+)="(.+)"|.
The key is supposed to be everything before the = is encountered. So [^=] captures it. The ^ is a negation metacharacter when used as the first character inside [] brackets. It means that it will match any character except =, which is what you want. The / is probably a mistake, if you need to capture the group, you should not use it, or if it is indeed intended, it means to literally match an opening parentheses. Since it is a special character, it needs to be escaped, that's why \(. if you mean to capture the group, it should be ([^=]+).
Next comes the = sign, which you don't care about. Then the quotes which contain the value. So you capture it like "(.+)". the .+ will go on matching greedily every character, including the final ". But then it will find that it can't match the final " in the regex, so it will backtrack, give up the last " the regex (.+) captured, so that leaves the string within the quotes to be captured in the group. Now you are ready to access the key and value through $1 and $2. Cool, isn't it?

RegEx with strange behaviour: matching String with back reference to allow escaping and single and double quotes

Matching a string that allows escaping is not that difficult.
Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php.
For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.
"(([^"\\]|\\.)*)"
The obvious improvement now is, to allow different quotes and use a backreference.
(["'])((\\.|[^\1\\])*?)\1
Also multiple backslashes are interpreted correctly.
Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):
test = 'foo'bar'
var = 'lol'
int = 7
So I wrote quite an expression. I found out that the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):
(["'])((\\.|[^\1\\])*?)\1([\r\n]+)
Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.
To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected.
Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?

You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).
So instead try the following:
(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)
or, as a verbose regex:
(["']) # match a quote
(?: # either match...
\\. # an escaped character
| # or
(?!\1). # any character except the previously matched quote
)* # any number of times
\1 # then match the previously matched quote again
(?:[\r\n]+) # plus one or more linebreak characters.
Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.
Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.
Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.

What does /([^.])\.(.)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.

/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.

It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.

Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.

Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal

It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.

That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".

It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).

This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar

This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.

IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]

the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html

It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js