KVP extraction but allow the separator in the value - regex

With a key value pair string that is separated by space character (just one I believe will ever happen) but also allows spaces and other white space (e.g. newlines, tabs) in the value, e.g.
a=1 b=cat c=1 and 2 d=3
becomes:
a=1
b=cat
c=1 and 2
d=3
i.e. I want to extract all the pairs as groups.
I cannot figure out the regex. My sample doesn't include newline but that could also happen
I've tried the basics like:
(.+?=.+?)
\s?([^\s]+)
but these fail with space and newlines. I'm coding it also so can tidy up any leading/trailing characters where needed, I just rather do it in regex than scan one character at a time.

You can use
([^\s=]+)=([\w\W]*?)(?=\s+[^\s=]+=|$)
See the regex demo. Details:
([^\s=]+) - Group 1: one or more chars other than whitespace and = char
= - a = char
([\w\W]*?) - Group 2: any zero or more chars, as few as possible
(?=\s+[^\s=]+=|$) - a positive lookahead that requires one or more whitespaces followed with one or more chars other than whitespace and = followed with = or end of string immediately to the right of the current location.
A better idea to match any character instead of [\w\W] is by using a . and the singleline/dotall modifier (if supported, see How do I match any character across multiple lines in a regular expression?), here is an example:
(?s)([^\s=]+)=(.*?)(?=\s+[^\s=]+=|$)

Related

Allow max one of each special char with one exception for single quotes

I was wondering if it is possible to simplify the following regular expression
^(?!.*([,\._\- ]).*\1)(?!.*[',\._\- ]{2})(?!.*(['])(?:.*\2){2})[',\._\- \p{L}]+$
Regex Demo
Constraints
All of these characters can appear a maximum of one time each: _-.
The last character is a white space.
This character can appear a maximum of two times each: '
All special characters here can not appear consecutively.
Details
(?!.*([,\._\- ]).*\1) - There cannot be 2 or more occurrences of any one character in _-.
(?!.*[',\._\- ]{2}) - No char in the special char group can appear consecutively.
(?!.*(['])(?:.*\2){2}) - Single quote special char cannot appear 3 or more times.
You might write the pattern using a single capture group in combination with negated character classes making use of contrast
You don' have to create a capture group ([']) with a backreference \2 to repeat a single quote 3 times, but you can just repeat that char 3 times.
As there do not seem to be ' at the start or at the end, you can use a repeating pattern \p{L}+(?:[',._ -]\p{L}+)* to not have consecutive chars that you don't want.
Note that you don't have to escape the ' and _ in the character class, and you can move the - to the end so that you don't have to escape it.
^(?![^,._ \n-]*([,._ -]).*\1)(?!(?:[^'\n]*'){3})\p{L}+(?:[',._ -]\p{L}+)*$
Explanation
^ Start of string
(?![^,._ \n-]*([,._ -]).*\1) Assert not 2 of the same chars [,._ -]
(?!(?:[^'\n]*'){3}) Assert not 3 times '
\p{L}+ Match 1+ times any kind of letter
(?:[',._ -]\p{L}+)* Optionally repeat one of [',._ -] and again 1+ times any kind of letter
$ End of string
Regex demo

Regex for 5-7 characters, or 6-8 if including a space (no special characters allowed)

I am trying to create a regex for some basic postcode validation. It doesn't need to provide full validation (in my usage it's fine to miss out the space, for example), but it does need to check for the number of characters being used, and also make sure there are no special characters other than spaces.
This is what I have so far:
^[\s.]*([^\s.][\s.]*){5,7}$
This mostly works, but it has two flaws:
It allows for ANY character, rather than just alphanumeric characters + spaces
It allows for multiple spaces to be inserted:
I have tried updating it as follows:
^[\s.]*([a-zA-Z0-9\s.][\s.]*){5,7}$
This seems to have fixed the character issue, but still allows multiple spaces to be inserted. For example, this should be allowed:
AB14 4BA
But this shouldn't:
AB1 4 4BA
How can I modify the code to limit the number of spaces to a maximum of one (it's fine to have none at all)?
With your current set of rules you could say:
^(?:[A-Za-z0-9]{5,7}|(?=.{6,8}$)[A-Za-z0-9]+\s[A-Za-z0-9]+)$
See an online demo
^ - Start-line anchor;
(?: - Open non-capture group for alternations;
[A-Za-z0-9]{5,7} - Just match 5-7 alphanumeric chars;
| - Or;
(?=.{6,8}$) - Positive lookahead to assert position is followed by at least 6-8 characters until the end-line anchor;
[A-Za-z0-9]+\s[A-Za-z0-9]+ - Match 1+ alphanumeric chars on either side of the whitespace character;
)$ - Close non-capture group and match the end-line anchor.
Alternatively, maybe a negative lookahead to prevent multiple spaces to occur (or at the start):
^(?!\S*\s\S*\s|\s)(?:\s?[A-Za-z0-9]){5,7}$
See an online demo where I replaced \s with [^\S\n] for demonstration purposes. Also, though being the shorter expression, the latter will take more steps to evaluate the input.

Remove characters from regex query

I have trouble understanding why my regex query takes one extra character besides the symbols I have told regex to include into the query, so this is my regex:
([\-:, ]{1,})[^0-9]
This is my test text:
Test- Product-: 1 --- 3 hour ,--kayak:--rental
It always includes the first character of each starting word, like P on Product or h on hour, how can I prevent regex from including those first characters?
I am trying to get all dashes, double points, comma and spaces excluding numbers or any characters.
The [^0-9] part of your regex matches any char but a digit, so you should remove it from your pattern.
There is no need to wrap the character class with a capturing group, and {0,1} is equal to +, so the whole regex can be shortened to
[-:, ]+
Note that - in the initial and end positions inside a character class does not have to be escaped.

How do you specify multiples in negative character classes in regular expressions?

I am trying to write a regular expression to search for anything but digits or the * or - characters, with one caveat. Where I'm hitting a wall is that I need to be able to allow three or less digits to be found but not four or more, though even one * or - shouldn't be found.
This is what I have so far (for three matches):
.*?([^0-9\*-]+).*?([^0-9\*-]+).*?([^0-9\*-]+).*?
I have no idea where to insert {4,} for the digits (I've tried and it doesn't seem to work anywhere) or how to change it to do as I want.
For instance, in "Jack has* 777 1883874 -sheep-" I'd like it to return "Jack has 777 sheep". Or in "2343klj-3***.net" I'd like it to return "klj 3 .net"
You may use the following regex (replacing with a literal space, " "):
(?:[-*\s]|\d{4,})+
See the regex demo. Replace with $1 (to insert one captured horizontal whitespace if any).
Details
(?:[-*\s]|\d{4,})+ - a non-capturing group matching one or more consecutive repetitions of
[-*\s] - 0+ whitespaces, - or/and *
| - or
\d{4,} - 4+ digits.
Next, to remove all leading and trailing whitespace you may use
^\s+|\s+$
and replace with an empty string. ^\s+ matches 1+ whitespaces at the start of the string and \s+$ matches 1+ whitespaces at the end of the string.
With the help here, this is what works. It may be impossible to do it all in one regex because of the conflict of needing no spaces at the beginning and end but spaces in between each remaining grouping.
First, a find and replace using ([-*\h]|\d{4,})+ and replacing with a space.
Second, using ^\s*(.*)\s*$.

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.
You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.
This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?: