preg_match / php style regex to find repeating alphanumeric characters, comma delimited?

preg_match / php style regex to find repeating alphanumeric characters, comma delimited? - regex

I'm trying to figure out a preg_match / php style regex to find repeating groups of alphanumeric characters(of any length), separated by commas?
so if I have string "c,b,a,xz,x,b,a,c,xz,x,x,b,a"
would return the first series of letters that repeat more than two values. I think I need to do a recursive backreference, maybe something like
<?php
// lines removed for simplicity
// test string = "c,b,a,xz,x,b,a,c,xz,x,x,b,a"
$haystack = "c,b,a,xz,x,b,a,c,xz,x,x,b,a";
$answer = preg_match('/([A-z]{2,*}[\s]{1})([A-z \s]*)[\1]*/', $haystack );
echo $answer; // print the first occurrence of the repeating series of two or more
?>
I just need to find and echo out the first occurrence of a repeating series of two or more values. Is there a way to use a backreference recursively, or some better method?
edit: code vomit removed.

'/\b(\w+,\w+),(?:.*,)?\1\b/' should work. It'd match any sequence of two items, any amount of other stuff, and then the same sequence again.
Catch is, it will likely find the first sequence that has a duplicate, not the sequence that has the first duplicate, due to how regexes work. (The match that starts earliest, wins.) For example, if you have 'a,b,c,d,c,d,a,b,c', $matches[1] would probably be 'a,b', even though 'c,d' would match earlier.
To find the first duplicate, you'd have to be able to match that and have a backreference to it in a lookbehind assertion. If that's even legal (which i doubt it is), it'd have to be fixed width before PHP would let it happen.
Edit:
Although, now that i think about it...if you reversed the string and then used '/.*\b(\w+,\w+),(?:.*?,)??\1\b/' on that, it might work. That dances around the constraint i'd mentioned; with the string reversed, the duplicate comes before the original, so now we can match the duplicate and then refer to it "later".
The .* at the beginning of the expression grabs as much as it can, so the match will start as close to the end of the reversed string (and therefore, as close to the beginning of the original string) as possible. And the extra ?s make their corresponding bits lazy, so they match as little as necessary. Of course, once you find the match in the reversed string, you'll need to reverse it in order to get the match in the original string.
And of course, this could break all to hell in the presence of UTF-8. (Then again, most regexes would.) If you're just dealing with ASCII, though, it should work.

Not a PHP expert, but I would think you could use this regex
~\b([a-zA-Z0-9]{2,})\b(?=.*\b\1\b)~ in a while loop.
In the body, you could track the results in a hash array (if php has that),
to print out unique series and positions. Capture buffer 1 has the series.

Related

Regex taking too many characters

I need some help with building up my regex.
What I am trying to do is match a specific part of text with unpredictable parts in between the fixed words. An example is the sentence one gets when replying to an email:
On date at time person name has written:
The cursive parts are variable, might contains spaces or a new line might start from this point.
To get this, I built up my regex as such: On[\s\S]+?at[\s\S]+?person[\s\S]+?has written:
Basically, the [\s\S]+? is supposed to fill in any letter, number, space or break/new line as I am unable to predict what could be between the fixed words tha I am sure will always be there.
Now comes the hard part, when I would add the word "On" somewhere in the text above the sentence that I want to match, the regex now matches a much bigger text than I want. This is due to the use of [\s\S]+.
How am I able to make my regex match as less characters as possible? Using "?" before the "+" to make it lazy does not help.
Example is here with words "From - This - Point - Everything:". Cases are ignored.
Correct: https://regexr.com/3jdek.
Wrong because of added "From": https://regexr.com/3jdfc
The regex is to be used in VB.NET
A more real life, with html tags, can be found here. Here, I avoided using [\s\S]+? or (.+)?(\r)?(\n)?(.+?)
Correct: https://regexr.com/3jdd1
Wrong: https://regexr.com/3jdfu after adding certain parts of the regex in the text above. Although, in html, barely possible to occur as the user would never write the matching tag himself, I do want to make sure my regex is correctjust in case
These things are certain: I know with what the part of text starts, no matter where in respect to the entire text, I know with what the part of text ends, and there are specific fixed words that might make the regex more reliable, but they can be ommitted. Any text below the searched part is also allowed to be matched, but no text above may be matched at all
Another example where it goes wrong: https://regexr.com/3jdli. Basically, I have less to go with in this text, so the regex has less tokens to work with. Adding just the first < already makes the regex take too much.
From my own experience, most problems are avoided when making sure I do not use any [\s\S]+? before I did a (\r)?(\n)? first

[\s\S] matches all character because of union of two complementary sets, it is like . with special option /s (dot matches newlines). and regex are greedy by default so the largest match will be returned.
Following correct link, the token just after the shortest match must be geschreven, so another way to write without using lazy expansion, which is more flexible is to prepend the repeated chracter set by a negative lookahead inside loop,
so
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft(.+?(?=geschreven))geschreven:
becomes
<blockquote type="cite" [^>]+?>[^O]+?Op[^h]+?heeft((?:(?!geschreven).)+)geschreven:
(?: ) is for non capturing the group which just encapsulates the negative lookahead and the . (which can be replaced by [\s\S])
(?! ) inside is the negative lookahead which ensures current position before next character is not the beginning of end token.
Following comments it can be explicitly mentioned what should not appear in repeating sequence :
From(?:(?!this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
or
From(?:(?!From|this)[\s\S])+this(?:(?!this|point)[\s\S])+point(?:(?!everything)[\s\S])+everything:
to understand what the technic (?:(?!tokens)[\s\S])+ does.
in the first this can't appear between From and this
in the second From or this can't appear between From and this
in the third this or point can't appear between this and point
etc.

Matching all strings without 3 occurrences of/or final single character in RegEx

Trying to figure out the regex for the title,
i.e.,
foo
foo/bar/foo
foo/bar/foo/bar
foo/bar/d
I don't want it to match the 3rd or the 4th one but match the first two. In the 2nd option, the final foo can be anything but a single d.

You could use a regex but it will be more complicated than just counting the number of slashes and also checking the last character isn't a d. If you want to use a regex to check for the last part not being "/d" you could do something like check that it doesn't match ^.*/d$ but it may be clearer to just use code. (If counting slashes and checking string doesn't end in "/d" isn't exactly what you mean then it will help to have more examples)

Figured it out. See below if anyone is interested.
(^foo/?$)|(^foo/[^/]+/(([^d][^/]*)|(d[^/]+))/?$)

positive look ahead and replace

Recently I'm writing/testing regexps on https://regex101.com/.
My question is: Is it possible to do a positive look-ahead AND a replacement in the same "replacement"? Or just limited kind of replacement is possible.
Input is several lines with phone numbers. Let's say the correct phone number where the number of "numbers" are 11. No matter how the numbers are divided/group together with - / characters, no matter if starts with + 00 or it is omitted.
Some example lines:
+48301234567
+48/30/1234567
+48-30-12-345-67
+483011223344556677
0048301234567
+(48)30/1234567
Positive look-ahead able to check if from the beginning until the end of line there are only 11 digits, regardless how many other, above specified character separating them. This works perfectly.
Where the positive look-ahead check is fine, I would like to delete every character but numbers. The replacement works fine until I'm not involving look-ahead.
Checking the regexp itself working perfectly ("gm" modes):
^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$
Checking the replace part works perfectly (replace to nothing):
[^\d\n]
Put this into look-ahead, after the deletion of non new-line and non-digit characters from the matching lines:
(?=^(?:\+|00)?(?:[\-\/\(\)]?\d){11}$)[^\d\n]
Even I put the ^ $ into look-ahead, seems the replacement working only from beginning of the lines until the very first digit.
I know in real life the replacement and the check should/would go separate ways, however I'm curious if I could mix look-ahead/look-behind with string operations like replace, delete, take the string apart and put together as I like.
UPDATE: This is what would do the trick, however I feel this one "ugly" a bit. Is there any prettier solution?
https://regex101.com/r/yT5dA4/2
Or the version which I asked originally, where only digits remains: regex101.com/r/yT5dA4/3

You cannot replace/delete text with regex. Regex is just a tool for matching certain strings and then taking certain action depending on the matching text, eg. perform a substitution, retrieve the second capture group.
However it is possible to perform certain decisions within a regex engine, by using conditionals. The common syntax for this, with a lookahead assertion, is (?(?=regex)then|else).
With conditionals you can change the behaviour depending on how the text matches the regex. For your example you could do something like:
^(\+)?(?(1)\(|\d)
If the phone number starts with a plus it must be followed by a bracket, else it should start with a digit. Although in your situation, this is not very useful.
If you want to read up more on conditionals in regex you can do so here.

Regular Expression to match most explicit string

I have some experience with regular expressions but I am far from expert level and need a way to match the record with the most explicit string in a file where each record begins with a unique 1-5 digit integer and is padded with various other characters when it is shorter than 5 digits. For example, my file has records that begin with:
32000
3201X
32014
320xy
In this example, the non-numeric characters represent wildcards. I thought the following regex examples would work but rather than match the record with the MOST explicit number, they always match the record with the LEAST explicit number. Remember, I do not know what is in the file so I need to test all possibilities to locate the MOST explicit match.
If I need to search for 32000, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3200\D|^32000/
It should match 32000 but it matches 320xy
If I need to search for 32014, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32014/
It should match 32014 but it matches 320xy
If I need to search for 32015, the regex looks something like:
/^3\D{4}|^32\D{3}|^320\D{2}|^3201\D|^32015/
It should match 3201x but it matches 320xy
In each case, the matched result is the LEAST specific numeric value. I also tried reversing the regex as follows by still get the same results:
/^32014|^3201\D|^320\D{2}|^32\D{3}|^3\D{4}/
Any help is much appreciated.

Okay, if you want to match a string literally then use anchors. Then specify the string you want matched. For instance match '123456xyz' where the xyz can be anything excep numeric use:
'^123456[^0-9]{3}$'
If you prefer specific letters to match at the end, if they will always be x y or z then use:
'^123456[xyz]{3}$'
Note the ^ and $ anchor the string to start with 12345 and end with three letters that are x y or z.
Good luck!

Ok, I did quite some tinkering here. I am 99% percent sure that this is pretty much impossible (if we don't cheat and interpolate code into the regex). The reason is you will need a negative lookbehind with variable length at some point.
However, I came up with two alternatives. One is if you want just to find the "most exact match", the second one is if you want to replace it with something. Here we go:
/(32000)|\A(?!.*32000).*(3200\D)|\A(?!.*3200[0\D]).*(320\D\D)|\A(?!.*320[0\D][0\D]).*(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D]).*(3\D\D\D\D)/m
Question:
So what is my "most exact match" here?
Answer:
The concatenation of the 5 matched groups - \1\2\3\4\5. In fact always only one of them will match, the other 4 will be empty.
/(32000)|\A(?!.*32000)(.*)(3200\D)|\A(?!.*3200[0\D])(.*)(320\D\D)|\A(?!.*320[0\D][0\D])(.*)(32\D\D\D)|\A(?!.*32[0\D][0\D][0\D])(.*)(3\D\D\D\D)/m
Question:
How can I use this to replace my "most exact match"?
Answer:
In this case your "most exact match" will be the concatenation of \1\3\5\7\9, but we will have also matched some other things before that, namely \2\4\6\8 (again, only one of these can be non empty). Therefore if you want to replace your "most exact match" with fubar you can match with the above regex and replace with \2\4\6\8fubar
Another way you can think about it (and might be helpful) is that your "most exact match" will be the last matched line of either of the two regexes.
Two things to note here:
I used Ruby style RE, \A means the beginning of the string (not the beginning of a line - ^). \m means multi line mode. You should be able to find syntax for the same things in your language/technology as long as it uses some flavor of PCRE.
This can be slow. If we don't find exact match we might possibly have to match and replace the entire string (if the non exact match can be found at the end of the string).

Insertion syntax for regex in Notepad++ or Perl

Shortform: searching:
"{,[0-9][0-9]," inserting Space+00... getting replaced string segment:
"{,SPACE00[0-9][0-9]," or other so-garbaged data for found [0-9][0-9] sequence ... so how do I search with a regex and insert in the middle???
Longform question:
I'm trying to do a series of simple character insertions -- digits actually -- in a series of mixed model CSV profiling data (five files each with different model parameters, several hundred lines each).
I'm visually challenged and desire to insert padding characters to columize data, so I can focus on tweaking key values, not keeping place data file to data file.
This need where the CSV data lines format are:
*Variable_symbolic-name*,{##,##,* ... ('Set of CSV Numerical Data lists' ...},\n*
an actual data line:
61,parameter17,{,70,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
to be morphed to:
61,parameter17,\t\t{, 0070,6,1,-1,3, 00,0,0,0,0,},,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Give or take a tab character to align all the { numeric field starts...
I've found searching: "{,[0-9][0-9]," failed but "\{,[0-9][0-9]," succeeds for the find part of the search and replace operation... but have hit a proverbial brick wall in how to do the actual replace (with an insert) of such a short length. (Obviously with so many parameters and files, I'm moving cautiously!)
However, This Perl Help tutorial leaves me in the dark as to how to keep the found ranges and insert padding before (Space, zero, zero to be specific if positive, '-00' if negative) In short, I need to know how to insert 2-3 places in the replace field in Notepad++... and retain the original data without prejudicing it!
Articles herein have cited replacing paragraphs and lines, adding newlines, etc. but this simple insertion alteration seems too simple for you all. But it's been several hours of frustration for me!
Thanks! // Frank
Resolved:
Good news: ({,)([0-9][0-9],) and \1 xx\2 works fine as does ({,)(#[0-9][0-9],) and replacing with \1 xx#\2 ... whether or not tabs are utilized. Obviously the key was ([0-9][0-9],) which included the discrimination of the comma... though I have no idea why that seemed to fail an hour ago with trials made using Sobrinho's help. Must have not tried the sequence. Thanks all!

Try to type this in the search box:
(.+)(\{,[0-9][0-9].*)
And in the replace:
\1\t\t\2
When you have things between parenthesis, they are "stored" by Notepad++ and can be reused in the replace box.
The order of the parenthesis starts with one and are accessed as \1, \2, ...

You tagged it as Perl, so here is how you do it in Perl ...
I prefer to use lookahead assertions rather than backreferences
s/(?= {,[0-9][0-9], ) /\t\t/x
Alternatively, $& contains the matched string ($0 is something different)
s/ {,[0-9][0-9], /\t\t$&/x

You will need a backreference here, meaning something which, in the replace part, will be equal to what you have matched.
Usually, the whole matched part is stored in the $0 backreference. (You can get $1 with a capture group too, and up to $2 with two capture groups, etc)
Back to your question, you could try this:
Find:
(\{,)([0-9][0-9],)
Replace by:
\t\t$1 00$2
This will insert two tab characters before the part that matched \{,[0-9][0-9], (or in other words, replace the part that matched by 2 tab characters and what you matched), then put the first captured part ({,) and then the space and double 0's and then the second captured part, the two digits and following comma.
regex101 demo

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js