For the strings:
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
I have the current regex:
m/(handle:|chat_identifier:)(.+?)(:{2}|&)/
And I am currently using $2 in order to obtain the value I wish (in the first string e#ma.il and in the second, chat0123456789).
Is there a better/faster/simpler way to solve this problem, though?
Whether it's "better" or not depends on the context, but you could take this approach: split the string on ":" and take the fourth element of the resulting list. That's arguably more readable than the regex and more robust if the third field can be something other than "handle" or "chat_identifier".
I think the speed would be very similar for either approach but probably for almost any implementation in perl. I'd want to show that speed was critical for this step before worrying about it...
For a regex solution, this one is slightly simpler and doesn't need to backtrack:
m/(handle|chat_identifier):([^:&]+)/
Note the slight difference: yours allows single colons within the value, mine doesn't (it stops at the first colon encountered). If that is not a problem, you can use my variant. Or as I mentioned in a comment, split at : and use the fourth element in the result.
An equivalent version that does only stop at double colons is this:
m/(handle|chat_identifier):((?:(?!::|&).)+)/
Not so beautiful, but it still avoids backtracking (the lookahead might make it slower, though... you will need to profile that, if speed matters at all).
Looks like you have allot of good solutions already here. The split method seems like the simplest. But depending on your requirements you could also use a more generic regex that breaks the string in its basic pieces. It will work for other datatypes and property names than in your examples.
([^:]+)::([^:]+):([^:&]+)(?:::|&)\1
The captures groups are as follows:
Group 1: the datatype. (the keyword "text" from your examples.)
Group 2: The property name. (The keywords "handle" and "chat_identifier"
from your examples.)
Group 3: The property value.
If the values you want are always in the same position and it's safe to split on : and &, then perhaps the following will work for you:
use Modern::Perl;
say +( split /[:&]+/ )[2] for <DATA>;
__DATA__
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
Output:
e#ma.il
chat0123456789
Related
I have a lot of strings which have similar values.
I need to write a regex that will keep all values except those that start with a specific substring, anyone know how I can do this.
For example, assume my string values are :
foo_bar
foo_baz
foo_bar_baz
foo_baz_bar
bar_baz
bar_foo
I can write a regex that will capture all of the above strings easily :
(foo_.*|bar_.*)
But supposing I have reasons for dropping anything that contains "foo_baz" and keep all the others.
i.e. my results would be :
foo_bar
foo_bar_baz
bar_baz
bar_foo
Is there any easy way to achieve this without explicitly listing each of the strings I want to keep?
Thanks.
You can use a negative lookahead:
^(?!foo_baz).*$
See https://regex101.com/r/jBCSjR/1
Or, depending on your programming language, it could be easier to filter out values using startsWith() or any equivalent.
I have lines of text similar to this:
value1|value2|value3 etc.
The values itself are not interesting. The type of delimeter is also unimportant, just like the number of fields. There could be 100 column, there could be only 5.
I would like to know what is the usual way to write a regexp which will put any given column's value into a capture group.
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
My problem is, that after a while I run into catastrophic backtracking or simply extremely low running times.
What is the usual way to solve a problem like this?
Edit:
For further clarification: this is a use case where further string operations are simply not permitted. Workarounds are not possible. I would like to know if there's a way simply based on regex or not.
As you stated:
My problem is, that after a while I run into catastrophic backtracking
or simply extremely low running times.
What is the usual way to solve a problem like this?
IMHO, You should prefer to perform string operations when you have a predefined structure in string (like for your case | character is used as a separator) because string operations are faster than using Regex which is designed to find a pattern. Like, in case the separators may change and we have to identify it first and then split based on separator, here the need of a Regex comes.
e.g.,
value1|value2;value3-value4
For your case, you can simply perform string split based on the separator character and access the respected index from an array.
EDIT:
If Regex is your only option then try using this regex:
^((.+?)\|){200}
Here 200 is the element I wish to access and seems a bit less time consuming than yours.
Demo
For example if I would like to get the content of the third field:
[^\|]+?\|[^\|]+?\|(?<capture_group>[^\|]+?)\|
Maybe a little bit nicer version:
(?:[^\|]+?\|){2}(?<capture_group>[^\|]+?)\|
But this could be the 7th, the 100th, the 1000th, it doesn't matter.
As a matter of "steps", using capture groups will cost more step.
However, using capture groups will allow you to condense your pattern and use a curly bracketed quantifier.
In your first pattern above, you can get away with "greedy" negated character classes (remove the ?) because they will halt at the next |, and you don't need to escape the pipe inside the square brackets.
When you want to access a "much later" positioned substring in the input string, not using a quantifier is going to require a horrifically long pattern and make it very, very difficult to comprehend the exact point that will be matched. In these cases, it would be pretty silly not to use a capture group and a quantifier.
I agree with Toto's comment; accessing an array of split results is going to be a very sensible solution if possible.
I'm pulling car submodels from the DB and I'm building my regular expression on the fly.
Here is an example of a search string:
EX-L Sedan 4-Door
Here is my regular expression:
preg_match("/LX|EX|EX-L|LX-P|LX-S/Ui", $input_line, $output_array);
For some reason the output is EX and not EX-L as it supposed to be. Can someone explain why?
Your pattern is unanchored and thus the first alternative that matches a substring makes the regex engine stop processing the whole group. This is a common behavior with NFA regexes.
Also, there are no quantifiers in your pattern, thus the /U modifier is redundant.
So, you can use
/EX-L|LX-P|LX-S|LX|EX/i
It is a readable form. However, best practice with regexes is to make sure no alternative branch can match at the same location as another. That means you can use
/EX(-L)?|LX(-[PS])?/i
As others have pointed out, the reason for this undesired outcome is because the regex engine is happy to have the first alternative and run for the door since your pattern has no anchors (like: ^, $, and some other lesser known ones). This is the same short-circuiting behavior you'd see in php's if($x || $y) conditions; if $x is true there is no need to evaluate further. But enough about that...
I would like to offer some additional logic that I think is relevant to your case/question.
You say your regex is built on the fly, so I am assuming your method goes something like this:
A user identifies which substrings/keywords they want to search for.
$strings=array('LX','EX','EX-L','LX-P','LX-S');
// array of substrings in any order
As mentioned earlier, you need longer strings to precede shorter ones with identical starting characters.
rsort($strings);
// sort DESC, longer strings precede shorter strings when leading characters match
Pipe all strings into a single regex pattern with implode().
$piped_regex='/\b(?:'.implode('|',$array).')\b/i';
// word boundaries ensure the string is not part of a larger word; remove if not desired
// pattern: /\b(?:LX-S|LX-P|LX|EX-L|EX)\b/i
While programmatically condensing your similar strings into a concise pattern as Wiktor recommended is possible, it's probably not worth the effort with your on-the-fly patterns.
Finally run preg_match() as normal.
$input_line='EX-L Sedan 4-Door';
if(preg_match($piped_regex,$input_line,$output_array)){
var_export($output_array);
}
// output: array(0=>'EX-L')
I hope stepping out this method is helpful to you and future SO readers.
Conditions updated
There is often a situation where you want to extract a substring upto (immediately before) certain characters. For example, suppose you have a text that:
Does not start with a semicolon or a period,
Contains several sentences,
Does not contain any "\n", and
Ends with a period,
and you want to extract the sequence from the start upto the closest semicolon or period. Two strategies come to mind:
/[^;.]*/
/.*?[;.]/
I do either of these quite randomly, with slight preference to the second strategy, and also see both ways in other people's code. Which is the better way? Is there a clear reason to prefer one over the other, or are there better ways? I personally feel, efficiency aside, that negating something (as with [^]) is conceptually more complex than not doing it. But efficiency may also be a good reason to chose one over the other.
I came up with my answer. The two regexes in my question were actually not expressing the same thing. And the better approach depends on what you want.
If you want a match up to and including a certain character, then using
/.*?[;.]/
is simpler.
If you want a match up to right before (excluding) a certain character, then you should use:
/[^;.]*/
Well, the first way is probably more efficient, not that it's likely to matter. By the way, \z in a character class does not mean "end of input"--in fact, it's a syntax error in every flavor I know of. /[^;.]*/ is all you need anyway.
I personally prefer the first one because it does exactly as you would expect. Get all characters except ...
But it's mostly a matter of preference. There are nearly always multiple ways to write a regular expression and it's mostly style that matters.
For example... do you prefer [0-9], [:digit:] or \d? They all do exactly* the same.
* In case of unicode the [:digit:] and \d classes match some other characters too.
you left out one other strategy. string split?
"my sentence; blahblah".split(/[;.]/,2)[0]
I think that it is mostly a matter of opinion as to which regular expression you use. On the note of efficiency, though, I think that adding \A to the beginning of a regular expression in this case would make the process faster because well designed regular expression engines should only try to match once in that case. For example:
/\A[^.;]/m
Note the m option; it indicates that newline characters can also be matched. This is just a technicality I would add for generic examples, but may not apply to you.
Although adding more to the solution might be viewed as increasing complexity, it can also serve to clarify meaning.
I want to require the following:
Is greater than seven characters.
Contains at least two digits.
Contains at least two special (non-alphanumeric) characters.
...and I came up with this to do it:
(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Now, I'd also like to make sure that no two sequential characters are the same. I'm having a heck of a time getting that to work though. Here's what I got that works by itself:
(\S)\1+
...but if I try to combine the two together, it fails.
I'm operating within the constraints of the application. It's default requirement is 1 character length, no regex, and no nonstandard characters.
Anyway...
Using this test harness, I would expect y90e5$ to match but y90e5$$ to not.
What an i missing?
This is a bad place for a regex. You're better off using simple validation.
Sometimes we cannot influence specifications and have to write the implementation regardless, i.e., when some ancient backoffice system has to be interfaced through the web but has certain restrictions on input, or just because your boss is asking you to.
EDIT: removed the regex that was based on the original regex of the asker.
altered original code to fit your description, as it didn't seem to really work:
EDIT: the q. was then updated to reflect another version. There are differences which I explain below:
My version: the two or more \W and \d can be repeated by each other, but cannot appear next to each other (this was my incorrect assumption), i fixed it for length>7 which is slightly more efficient to place as a typical "grab all" expression.
^(?!.*((\S)\1|\s))(?=.*(\d.+){2,})(?=.*(\W.+){2,}).{8,}
New version in original question: the two or more \W and the \d are allowed to appear next to each other. This version currently support length>=6, not length>7 as is explained in the text.
The current answer, corrected, should be something like this, which takes the updated q., my comments on length>7 and optimizations, then it looks like: ^(?!.*((\S)\1|\s))(?=(.*\d){2,})(?=(.*\W){2,}).{8,}.
Update: your original code doesn't seem to work, so I changed it a bit
Update: updated answer to reflect changes in question, spaces not allowed anymore
This may not be the most efficient but appears to work.
^(?!.*(\S)\1)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Test strings:
ad2f#we1$ //match valid.
adfwwe12#$ //No Match repeated ww.
y90e5$$ //No Match repeated $$.
y90e5$ //No Match too Short and only 1 \W class value.
One of the comments pointed out that the above regex allows spaces which are typically not used for password fields. While this doesn't appear to be a requirement of the original post, as pointed out a simple change will disallow spaces as well.
^(?!.*(\S)\1|.*\s)(?=.{6,})(?=(.*\d){2,})(?=(.*\W){2,})
Your regex engine may parse (?!.*(\S)\1|.*\s) differently. Just be aware and adjust accordingly.
All previous test results the same.
Test string with whitespace:
ad2f #we1$ //No match space in string.
If the rule was that passwords had to be two digits followed by three letters or some such, or course a regular expression would work very nicely. But I don't think regexes are really designed for the sort of rule you actually have. Even if you get it to work, it would be pretty cryptic to the poor sucker who has to maintain it later -- possibly you. I think it would be a lot simpler to just write a quick function that loops through the characters and counts how many total and how many of each type. Then at the end check the counts.
Just because you know how to use regexes doesn't mean you have to use them for everything. I have a cool cordless drill but I don't use it to put in nails.