shortend URL with my current regex in regexpal:
http://bit.ly/1jbOFGd
I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.
I want to tokenize the key and value, which I later convert to a dict in python.
My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.
Update:
(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))
I would think this one is more efficient than yours (even though it still uses the .*? for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*? match followed by a lookahead that is either a space, followed by a key, followed by a = OR the end of the string. Notice I always define a key as [^\s=]+, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).
Source
Original:
Are there some rules I am missing that you need by doing something this simple?
(?<=\s|\A)([^=]+)=([\S]+)
This starts with a lookbehind of either a space character (\s) or the beginning of the string (\A). Then we match everything except =, followed by a =, and match everything except whitespace (\s).
"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.
Good examples here: http://www.rexegg.com/regex-lookarounds.html.
Related
I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.
Could you please help me define a regex that would:
match the word r'(\d+_\d\d\d(?:_back)?)'
"word" means that it shouldn't be preceded or followed by anything except for the proper punctuation signs or beginning/end of string/line
work in multiline strings, anywhere in the strings, and in strings consisting only of this pattern and nothing else
not match in %96_175" and 44_5555 (because neither the % nor the 4th "5" are punctuation characters).
Examples:
Pass (12_345, 012_345, or 012_345_back is the found group):
['12_345',
'bla-bla 012_345',
'bla-bla 12_345 bla-bla',
'34\n012_345',
'012_345\n34',
'text—012_345—text',
'text--12_345, text',
'text. 012_345_back.']
Fail (no match here):
[
'text12_345',
'12_345text',
'12_3456',
'%12_345',
'!12_345',
'.12-345',
'12_345_front'
]
What I am trying to distinguish is the proper identifier of the form \d+_\d\d\d(?:_back), inserted by a user in a comment in my web-site, from the same string being part of another string. The simple regex worked until someone inserted a link to a Wikipedia article ending with "№_175', which was URL-encoded to %E2%84%96_175, "96_175" matching my pattern.
I've got stuck at trying to match the "proper punctuation signs" or the beginning or end of string or line in a string. And by then the regex was already so complex (I was listing all reasonable unicode punctuation characters I could think of) that I thought I was doing something wrong. I also have difficulties excluding extra digits but including possible end of line or string.
Depending how do you need to handle (or not-handle) non-letter non-proper-punctuation symbols you can either rely on Python re word detection \b (as suggested by one of answers) or enumerate the 'proper' punctuation marks in opening and closing non-matching group.
With old regex (Python 2.5) you could use a punctuation wildcard \p
(?:\p*|^|\s)(\d+_\d\d\d)(_back)?(?:\n|\p|$|\s)
With modern re (Python 2.6 and higher)
just replace \p with string.punctuation along the lines of
https://stackoverflow.com/a/37708340/5874981
For starter, assuming that sufficiently 'proper' are only full stop, comma and hyphen try
(?:^|\s|\.|,|-)(\d+_\d\d\d)(_back)?(?:$|\s|\.|,|-)
I'm not sure if I'm misunderstanding the question but if the only problem you're having is to match a whole word and ignore any other characters than the ones you want, I'd suggest you to try regex word boundary
So your regular expression would be \b\d+_\d\d\d(?:_back)?\b
Give it a try and tell me if that's what you need.
This is my RegEx:
"^[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
I need to match only strings less than 255 characters.
I've tried adding the word boundaries at the start of the RegEx but it fails:
"^(?=.{1,254})[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
You need the $ in the lookahead to make sure it's only up to 254. Otherwise, the lookahead will match even when there are more than 254.
(?=.{1,254}$)
Also, keep in mind that you can greatly simplify your regex because many characters that would usually need to be escaped do not need to when in a character class (square brackets).
"[\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]"
is the same as this:
"[-\w!#$%&'*+/=`{|}~?^]"
Note that the dash must be first in the character class to be a literal dash, and the caret must not be first.
With some other simplifications, here is the complete string:
"^(?=.{1,254}$)[-\w!#$%&'*+/=`{|}~?^]+(\.[-\w!#$%&'*+/=`{|}~?^]+)*#((\d{1,3}\.){3}\d{1,3}|([-\w]+\.)+[a-zA-Z]{2,6})$"
Notes:
I removed the stipulation that the first char shouldn't be a period ([^.]) because the next character class doesn't match a period anyway, so it's redundant.
I removed many extraneous parens
I replaced [0-9] with \d
I replaced {0,1} with the shorthand "?"
After the # sign, it seemed that you were trying to match an IP address or text domain name, so I separated them more so it couldn't be a combination
I'm not sure what the optional square bracket at the end was for, so I removed it: "(]?)"
I tried it in Regex Hero, and it works. See if it works for you.
This depends on what language you are working in. In Python for example you can regex to split a text into separate strings, and then use len() to remove strings longer than the 255 characters you want
I think this post will help. It shows how to limit certain patterns but I am not sure how you would add it to the entire regex.
I need a single regex to check if input must not be empty plus the input has alphanumeric characters only.
I know the alphanumeric part,^[\s+0-9a-zA-Z]+$, but I am not sure about the not empty requirement.
I can only use a single expression and I can't use any language function.
Simply use this regex to match a non-empty alphanumeric string:
^[a-zA-Z0-9]+$
Details
^ - string start
[a-zA-Z0-9]+ - one or more letters or digits
$ - string end.
I'm going to assume by Not empty you mean not only white space, otherwise you've got the answer you want. + means one or more.
^[a-zA-Z0-9][a-zA-Z0-9\s]*^
will make sure that the string has something other than white space in it.
Additionally if \s is valid then I assume \w is as well, meaning that this could more easily be said as
^[(?:\w|\s)*$
The ?: in the ( ) makes it a non-capture group. If you don't care about capture then this can be omitted, making it the very terse.
^\w(\w|\s)*$
I need a regex that will match strings of letters that do not contain two consecutive dashes.
I came close with this regex that uses lookaround (I see no alternative):
([-a-z](?<!--))+
Which given the following as input:
qsdsdqf--sqdfqsdfazer--azerzaer-azerzear
Produces three matches:
qsdsdqf-
sqdfqsdfazer-
azerzaer-azerzear
What I want however is:
qsdsdqf-
-sqdfqsdfazer-
-azerzaer-azerzear
So my regex loses the first dash, which I don't want.
Who can give me a hint or a regex that can do this?
This should work:
-?([^-]-?)*
It makes sure that there is at least one non-dash character between every two dashes.
Looks to me like you do want to match strings that contain double hyphens, but you want to break them into substrings that don't. Have you considered splitting it between pairs of hyphens? In other words, split on:
(?<=-)(?=-)
As for your regex, I think this is what you were getting at:
(?:[^-]+|-(?<!--)|\G-)+
The -(?<!--) will match one hyphen, but if the next character is also a hyphen the match ends. Next time around, \G- picks up the second hyphen because it's the next character; the only way that can happen (except at the beginning of the string) is if a previous match broke off at that point.
Be aware that this regex is more flavor dependent than most; I tested it in Java, but not all flavors support \G and lookbehinds.