Check array syntax with Regex - regex

I'm trying to create a regex that checks if a string is a valid path for Firestore document.
I will find a regex that testing if a string:
start with a char ^([a-z]{1})
after first char, there will be only letter/digit and/or a dot \w*(.?\w+){0,}
last chars in the string could be an index of an array (\[{1}\d+\]{1})?$
First and second points work well but the last group doesn't work. I test a string like data.images[11 and the regex return true.

first of all you can shorten some quantifiers in your regex:
{1} -> can be ignored completely
{0,} -> *
Your second part could be expressed like this, this will also support readability:
[\w.]* meaning: take any character inside the brackets 0 to n-times. The bracket expression also supports predefined classes, so we are using \w here. The dot INSIDE the brackets doesn't need to be escaped, it simply means the one character dot.
So your parts would be:
^([a-z])
[\w.]*
(\[\d+\])?$
I hope this helps. According to regexpal it matches data.images[11], but not data.images[11. Also it seems to support all your demands.
EDIT:
Your second part doesn't work because (like Asocia stated in the answer) you would need to escape the dot. The dot itself is a class meaning "any character" (depending on regex engine and settings sometimes even line breaks). As you mean the dot as a character you need to escape it.

Related

Regex - Find everything between OR-operators except OR between quotes

I need some help with a Regex. I have a query, that should be splitted between all OR-operators. But if the OR is inside of quotes, it should not splitted.
Example:
This is the query:
"test1" OR "test2.1 OR test2.2" OR test3 OR test4:"test4.1 OR test4.2"
Expression 1: I need everything between the OR-operators or start/end of line... (This is not working)
(^|OR).*?(OR|$)
Expression 2: ...except of the ORs between quotes:
"(.*?)"
The result should be:
"test1"
"test2.1 OR test2.2"
test3
test4:"test4.1 OR test4.2"
How can I make the first expression work and how can I combine these both expressions?
Thank you for help!
It's unclear what the grammar of your expression is, so I just make a bunch of assumptions and come up with this regex to match the tokens between OR:
\G(\w+(?::"[^"]*")?|"[^"]*")(?:(\s+OR\s+)|\s*$)
Demo at regex101
I assume that between OR, it can be an identifier \w+, an identifier with some string \w+:"[^"]*", or a string literal "[^"]*".
Feel free to substitute your own definition of string literal - I'm using the simplest (and broken) specification "[^"]*" as example.
In every match, the regex starts from where the last match left off (or the beginning of the string) and matches one token (as described above), followed by OR or the end of the input string.
The capturing groups at (\s+OR\s+) is deliberate - you will need this to check whether the last match actually terminates at the end of the string or not, or whether the input is malformed.
Caveat
Do note that while my solution produces the expected result for this case, without a full specification of the grammar of the expression, it's not possible to cater for all possible cases you may want to handle.
(?:^|OR(?=(?:[^"]*"[^"]*")*+[^"]*$))([\s\S]*?)(?=OR(?=(?:[^"]*"[^"]*")*+[^"]*$)|$)
You can use this and capture the groups.See demo.
https://regex101.com/r/xC4rJ3/12
Try to match everything in quotes or not-OR with:
(?:"[^"]+"|\b(?:(?!\bOR\b)[^"])+)+
DEMO
This regex works optimally (though it be subject to improvement with a more detailed specification):
(?<!\S)(?!OR\s)[^\s"]*(?:"[^"]*"[^\s"]*)*
DEMO
(?<!\S) ensures the match starts at the beginning of the string or after a whitespace character.
(?!OR\s) prevents it from matching OR
[^\s"]*(?:"[^"]*"[^\s"]*)* matches a contiguous series of, in any order:
sequences of non-whitespace, non-quote characters, or
a pair of quotes enclosing anything except quotes.
However, I notice that all the tokens in your example consist of:
a non-quote, non-whitespace sequence (NQ),
a quoted sequence (Q), or
an NQ followed immediately by a Q.
If you expect all tokens to match that pattern, you can change the regex to this:
(?<!\S)(?!OR\s)(?:[^\s"]*"[^"]*"|[^\s"]+)
According to Regex101, it's slightly more efficient (but probably not enough to matter).
DEMO

Perl code understanding

I am new to perl language - I have been trying to understand the below code
if ( $nextvalue !~ /^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126/)
Can you please help us understand what is above meant for?
condition will be true when $nextvalue will NOT match following regular expression.
Regular expressiion will match if that string
either
starts with at least one character,
followed by double quote sign ("),
followed by at least one non-whitespace character,
followed by whitespace (),
followed by string "/cs/",
followed by at least one character,
followed by whitespace and string HTTP/,
followed by one of digits from 1 to 9 inclusive,
followed by dot
followed by one of digits from 0 to 9,
followed by double quote mark (")
or contains two forward slashes (//)
or contains sunstring "/Images/fold/1.jpg"
or contains substring "/busines"
or contains substring "/Type= OPTIONS"
or contains substring "/203.176.111.126"
Whenever i am unsure what some cryptic regular expression does, i turn to Debuggex:
^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126
Debuggex Demo
This is a railroad diagram, every string that has a substring fitting the description along any of the grey tracks will match your regex. As your condition uses !~ meaning "does not match", those strings will then fail the check.
Debuggex certainly has issues (for example it displays ^, meaning you would have to know that this means the beginning of the string, same for dots and other, whitespaces show up as underscroes, etc.) but it certainly helps in understanding the structure of the expression and possibly gives you an idea what the author had in mind.

Regular expression: page path starts with "/posts/" and ends with ".html"

I'm stuck here:
=~^/posts/(*).html
but it doesn't work
I need something that can recognise something like this:
/posts/testing.html
/posts/another-testing-issue.html
And I'm not very good using RegEx
Can anyone help me please?
EDIT:
Floris had the right answer:
^/posts/.*html$
thank you!
Briefly, the expression you need is
^\/posts\/.*\.html$
Explanation:
^ start of string
\/posts\/ literal string '/posts/'
the backslash "protects" the forward slash -
it is called "escaping", and removes any special meaning it might have
(in some applications the / would be a delimiter)
.* any number of characters
\. literal '.'
html literal 'html'
$ end of string
Now for a bit more background on regex syntax…
A
s #Peter points out in the comment, a quantifier follows "the thing to quantify". In most (all?) regex syntaxes, writing (*) will generate the error preceding token is not quantifiable. You need something in front of the *, and a ( doesn't count (unless it was escaped).
This is where the dot comes in. The dot . means "any character at all. That is its usual meaning, which is why.*` is just about the most common thing in regular expressions, meaning "I don't care about the next bit…" (usually up to an "until" - whatever follows).
Because the dot has a special meaning, when you want the exact string .html, you need to write it as \.html (there's that escape backslash again to remove the special meaning from the dot).
As a final tweak, it is not uncommon to have an extension like .htm - so you could write your expression as
\/posts\/.*\.html?$
This would make the last character, the l, optional (the ? means "zero or one times the preceding expression, which in this case is the single character immediately before it).
You can see this at work at http://regex101.com/r/bK5yC7 - it is a wonderful tool for exploring regular expressions, and also gives a nice explanation (breakdown) of every expression you type (with highlighting of any errors)
You missed a dot as single character match and didn't escape the second one as being literal:
^/posts/(.*)\.html
In most of regular expression . mean any character and * means multiplicity, so try to fix to
^/posts/(.*)\.html
\ is escape character

Limiting RegEx to match only a string of 1-254 characters length

This is my RegEx:
"^[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
I need to match only strings less than 255 characters.
I've tried adding the word boundaries at the start of the RegEx but it fails:
"^(?=.{1,254})[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
You need the $ in the lookahead to make sure it's only up to 254. Otherwise, the lookahead will match even when there are more than 254.
(?=.{1,254}$)
Also, keep in mind that you can greatly simplify your regex because many characters that would usually need to be escaped do not need to when in a character class (square brackets).
"[\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]"
is the same as this:
"[-\w!#$%&'*+/=`{|}~?^]"
Note that the dash must be first in the character class to be a literal dash, and the caret must not be first.
With some other simplifications, here is the complete string:
"^(?=.{1,254}$)[-\w!#$%&'*+/=`{|}~?^]+(\.[-\w!#$%&'*+/=`{|}~?^]+)*#((\d{1,3}\.){3}\d{1,3}|([-\w]+\.)+[a-zA-Z]{2,6})$"
Notes:
I removed the stipulation that the first char shouldn't be a period ([^.]) because the next character class doesn't match a period anyway, so it's redundant.
I removed many extraneous parens
I replaced [0-9] with \d
I replaced {0,1} with the shorthand "?"
After the # sign, it seemed that you were trying to match an IP address or text domain name, so I separated them more so it couldn't be a combination
I'm not sure what the optional square bracket at the end was for, so I removed it: "(]?)"
I tried it in Regex Hero, and it works. See if it works for you.
This depends on what language you are working in. In Python for example you can regex to split a text into separate strings, and then use len() to remove strings longer than the 255 characters you want
I think this post will help. It shows how to limit certain patterns but I am not sure how you would add it to the entire regex.

What does /([^.]*)\.(.*)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.