Perl code understanding - regex

I am new to perl language - I have been trying to understand the below code
if ( $nextvalue !~ /^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126/)
Can you please help us understand what is above meant for?

condition will be true when $nextvalue will NOT match following regular expression.
Regular expressiion will match if that string
either
starts with at least one character,
followed by double quote sign ("),
followed by at least one non-whitespace character,
followed by whitespace (),
followed by string "/cs/",
followed by at least one character,
followed by whitespace and string HTTP/,
followed by one of digits from 1 to 9 inclusive,
followed by dot
followed by one of digits from 0 to 9,
followed by double quote mark (")
or contains two forward slashes (//)
or contains sunstring "/Images/fold/1.jpg"
or contains substring "/busines"
or contains substring "/Type= OPTIONS"
or contains substring "/203.176.111.126"

Whenever i am unsure what some cryptic regular expression does, i turn to Debuggex:
^.+"[^ ]+ \/cs\/.+\sHTTP\/[1-9]\.[0-9]"|\/\/|\/Images\/fold\/1.jpg|\/busines|\/Type= OPTIONS|\/203.176.111.126
Debuggex Demo
This is a railroad diagram, every string that has a substring fitting the description along any of the grey tracks will match your regex. As your condition uses !~ meaning "does not match", those strings will then fail the check.
Debuggex certainly has issues (for example it displays ^, meaning you would have to know that this means the beginning of the string, same for dots and other, whitespaces show up as underscroes, etc.) but it certainly helps in understanding the structure of the expression and possibly gives you an idea what the author had in mind.

Related

Regular expression to match a word that contains ONLY one colon

I am new to regex, basically I'd like to check if a word has ONLY one colons or not.
If has two or more colons, it will return nothing.
if has one colon, then return as it is. (colon must be in the middle of string, not end or beginning.
(1)
a:bc:de #return nothing or error.
a:bc #return a:bc
a.b_c-12/:a.b_c-12/ #return a.b_c-12/:a.b_c-12/
(2)
My thinking is, but this is seems too complicated.
^[^:]*(\:[^:]*){1}$
^[-\w.\/]*:[-\w\/.]* #this will not throw error when there are 2 colons.
Any directions would be helpful, thank you!
This will find such "words" within a larger sentence:
(?<= |^)[^ :]+:[^ :]+(?= |$)
See live demo.
If you just want to test the whole input:
^[^ :]+:[^ :]+$
To restrict to only alphanumeric, underscore, dashes, dots, and slashes:
^[\w./-]+:[\w./-]+$
I saw this as a good opportunity to brush up on my regex skills - so might not be optimal but it is shorter than your last solution.
This is the regex pattern: /^[^:]*:[^:]*$/gm and these are the strings I am testing against: 'oneco:on' (match) and 'one:co:on', 'oneco:on:', ':oneco:on' (these should all not match)
To explain what is going on, the ^ matches the beginning of the string, the $ matches the end of the string.
The [^:] bit says that any character that is not a colon will be matched.
In summary, ^[^:] means that the first character of the string can be anything except for a colon, *: means that any number of characters can come after and be followed by a single colon. Lastly, [^:]*$ means that any number (*) of characters can follow the colon as long as they are not a colon.
To elaborate, it is because we specify the pattern to look for at the beginning and end of the string, surrounding the single colon we are looking for that only the first string 'oneco:on' is a match.

Check array syntax with Regex

I'm trying to create a regex that checks if a string is a valid path for Firestore document.
I will find a regex that testing if a string:
start with a char ^([a-z]{1})
after first char, there will be only letter/digit and/or a dot \w*(.?\w+){0,}
last chars in the string could be an index of an array (\[{1}\d+\]{1})?$
First and second points work well but the last group doesn't work. I test a string like data.images[11 and the regex return true.
first of all you can shorten some quantifiers in your regex:
{1} -> can be ignored completely
{0,} -> *
Your second part could be expressed like this, this will also support readability:
[\w.]* meaning: take any character inside the brackets 0 to n-times. The bracket expression also supports predefined classes, so we are using \w here. The dot INSIDE the brackets doesn't need to be escaped, it simply means the one character dot.
So your parts would be:
^([a-z])
[\w.]*
(\[\d+\])?$
I hope this helps. According to regexpal it matches data.images[11], but not data.images[11. Also it seems to support all your demands.
EDIT:
Your second part doesn't work because (like Asocia stated in the answer) you would need to escape the dot. The dot itself is a class meaning "any character" (depending on regex engine and settings sometimes even line breaks). As you mean the dot as a character you need to escape it.

What is the diffrence between these three regular expressions

What is the main difference between the following 3 regular expressions.
1) /^[^0-9]+$/
2)/[^0-9]+/
3) m/[^0-9]+/
I am really trying to understand this, since researching online has not helped me much I was hoping I could find some help here.
All of them have [^0-9]+, which is one or more characters that are not the numbers 0, 1, ... to 9.
The first one /^[^0-9]+$/ is anchored at the start and end of the string, so it will match any string that only contains non-digits.
The second one /[^0-9]+/ is not anchored, so it matches any string that contains at least one (or more) non-digits.
The third one m/[^0-9]+/ is the same as the second, but uses the m// match operator explicitly.
For a good explanation, check out regex101.com for the first and second regex.
There's a difference between a regular expression and the match operator which takes a regular expression as its operand.
You only have two regular expressions there - ^[^0-9]+$ and [^0-9]+. Option 3 uses the same regex as option 2, but it uses a different version of the match operator.
The difference between 1 and 2 is that 1 is anchored at the start and the end of the string, whereas 2 isn't anchored at all.
So 1 says "match the start of the string, followed by one or more non-digits, followed by the end of the string". 2 says "match one or more non-digits anywhere in the string".
Does that help at all?
The pattern [^0-9] is common to these three regexes, and will match any single character that is not a decimal digit
/^[^0-9]+$/
This anchors the pattern to the beginning and end of the string, and insists that it contains one or more non-digit characters
The circumflex ^ is a zero-width anchor that matches the beginning of the string
The dollar sign $ is also a zero-width anchor that will match either at the end of the string, or before a newline character if that newline is the last in the string. So this will match "aaa" and "aaa\n" but not "aa7bb\n"
/[^0-9]+/
This has no anchors, and so will return true if the string contains at least one non-digit character anywhere
It will match "12x345" and fail to match "12345". Note that a trailing newline counts as a non-digit character, so this pattern will match "123\n"
m/[^0-9]+/
This is identical to #2, but with the m placed explicitly. This is unnecessary if you are using the default slashes for delimiters, but it can be convenient to use something different if you are matching a pattern for, say, a file path, which itself contains slashes
Using m lets you choose your own delimiter, for example m{/my/path} instead of /\/my\/path/
In essence, #1 is asking whether the string is wholly composed of non-digit characters, while #2 and #3 are identical, and test whether the string contains at least one non-digit character

Understanding regex in shell

I came across single grouping concept in shell script.
cat employee.txt
101,John Doe,CEO
I was practising SED substitute command and came across with below example.
sed 's/\([^,]*\).*/\1/g' employee.txt
It was given that above expression matches the string up to the 1st comma.
I am unable to understand how this matches the 1st comma.
Below is my understanding
s - substitute command
/ delimiter
\ escape character for (
( opening braces for grouping
^ beginning of the line - anchor
[^,] - i am confused in this , is it negate of comma or mean something else?
why * and again .* is used to match the string up to 1st comma?
^ matches beginning of line outside of a character class []. At the beginning of a character class, it means negation.
So, it says: non-comma ([^,]) repeated zero or more times (*) followed by anything (.*). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
I know 'link only' answers are to be avoided - Choroba has correctly pointed out that this is:
non-comma ([^,]) repeated zero or more times () followed by anything (.). The matching part of the string is replaced by the part before the comma, so it removes everything from the first comma onward.
However I'd like to add that for this sort of thing, I find regulex quite a useful tool for visualising what's going on with a regular expression.
The image representation of your regular expression is:
Given the string "foo, bar", s/\([^,]*\).*/\1/g, and more specifically \([^,]\)*) means, "match any character that is not a comma" (zero or more times). Since "f" is not a comma, it matches "f" and "remembers" it. Because it is "zero or more times", it tries again. The next character is not a comma either (it is o), then, the regex engine adds that o to the group as well. The same thing happens for the 2nd o.
The next character is indeed a comma, but [^,] forbids it, as #choroba affirmed. What is in the group now is "foo". Then, the regex uses .* outside the group which causes zero or more characters to be matched but not remembered.
In the replacement part of the regex, \1 is used to place the contents of the remembered text ("foo"). The rest of the matched text is lost and that is how you remain with only the text up to the first comma.

What does /([^.]*)\.(.*)/ mean?

When I searched about something, I found an answered question in this site. 2 of the answers contain
/([^.]*)\.(.*)/
on their answer.
The question is located at Find & replace jquery. I'm newbie in javascript, so I wonder, what does it mean? Thanks.
/([^.]*)\.(.*)/
Let us deconstruct it. The beginning and trailing slash are delimiters, and mark the start and end of the regular expression.
Then there is a parenthesized group: ([^.]*) The parentheseis are there just to group a string together. The square brackets denote a "character group", meaning that any character inside this group is accepted in its place. However, this group is negated by the first character being ^, which reverse its meaning. Since the only character beside the negation is a period, this matches a single character that is not a period. After the square brackets is a * (asterisk), which means that the square brackets can be matched zero or more times.
Then we get to the \.. This is an escaped period. Periods in regular expressions have special meaning (except when escaped or in a character group). This matches a literal period in the text.
(.*) is a new paranthesized sub-group. This time, the period matches any character, and the asterisk says it can be repeated as many times as needs to.
In summary, the expression finds any sequence of characters (that isn't a period), followed by a single period, again followed by any character.
Edit: Removed part about shortening, as it defeats the assumed purpose of the regular expression.
It's a regular expression (it matches non-periods, followed by a period followed by anything (think "file.ext")). And you should run, not walk, to learn about them. Explaining how this particular regular expression works isn't going to help you as you need to start simpler. So start with a regex tutorial and pick up Mastering Regular Expressions.
Original: /([^.]*)\.(.*)/
Split this as:
[1] ([^.]*) : It says match all characters except . [ period ]
[2] \. : match a period
[3] (.*) : matches any character
so it becomes
[1]Match all characters which are not . [ period ] [2] till you find a .[ period ] then [3] match all characters.
Anything except a dot, followed by a dot, followed by anything.
You can test regex'es on regexpal
It's a regular expression that roughly searches for a string that doesn't contain a period, followed by a period, and then a string containing any characters.
That is a regular expression. Regular expressions are powerful tools if you use them right.
That particular regex extracts filename and extension from a string that looks like "file.ext".
It's a regular expression that splits a string into two parts: everything before the first period, and then the remainder. Most regex engines (including the Javascript one) allow you to then access those parts of the string separately (using $1 to refer to the first part, and $2 for the second part).
This is a regular expression with some advanced use.
Consider a simpler version: /[^.]*\..*/ which is the same as above without parentheses. This will match just any string with at least one dot. When the parentheses are added, and a match happens, the variables \1 and \2 will contain the matched parts from the parentheses. The first one will have anything before the first dot. The second part will have everything after the first dot.
Examples:
input: foo...bar
\1: foo
\2: ..bar
input: .foobar
\1:
\2: foobar
This regular expression generates two matching expressions that can be retrieved.
The two parts are the string before the first dot (which may be empty), and the string after the first dot (which may contain other dots).
The only restriction on the input is that it contain at least one dot. It will match "." contrary to some of the other answers, but the retrived groups will be empty.
IMO /.*\..*/g Would do the same thing.
const senExample = 'I am test. Food is good.';
const result1 = senExample.match(/([^.]*)\.(.*)/g);
console.log(result1); // ["I am test. Food is good."]
const result2 = senExample.match(/^.*\..*/g);
console.log(result2); // ["I am test. Food is good."]
the . character matches any character except line break characters the \r or \n.
the ^ negates what follows it (in this case the dot)
the * means "zero or more times"
the parentheses group and capture,
the \ allows you to match a special character (like the dot or the star)
so this ([^.]*) means any line break repeated zero or more times (it just eats up carriage returns).
this (.*) part means any string of characters zero or more times (except the line breaks)
and the \. means a real dot
so the whole thing would match zero or more line breaks followed by a dot followed by any number of characters.
For more information and a really great reference on Regular Expressions check out: http://www.regular-expressions.info/reference.html
It's a regular expression, which basically is a pattern of characters that is used to describe another pattern of characters. I once used regexps to find an email address inside a text file, and they can be used to find pretty much any pattern of text within a larger body of text provided you write the regexp properly.