Regex - Skip characters to match - regex

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.

Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}

If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.

At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21

If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python

Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

Related

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

Regex to MATCH number string (with optional text) in a sentence

I am trying to write a regex that matches only strings like this:
89-72
10-123
109-12
122-311(a)
22-311(a)(1)(d)(4)
These strings are embedded in sentences and sometimes there are 2 potential matches in the sentence like this:
In section 10-123 which references section 122-311(a) there is a phone number 456-234-2222
I do not want to match the phone. Here is my current working regex
\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*
see DEMO
I've been looking on Stack and have not found anything yet. Any help would be appreciated. Will be using this in a google sheet and potentially postgres.
Based on regex, suggested by #Wiktor Stribiżew:
=REGEXEXTRACT(A1,REPT("\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)(?:.*)",LEN(REGEXREPLACE(REGEXREPLACE(A1,"\b(\d{2,3}-\d{2,3}\b(?:\([A-Za-z0-9]\))*)(?:[^-]|$)", char (9)),"[^"&char(9)&"]",""))))
The formula will return all matches.
String:
A
In 22-311(a)(1)(d)(4) section 10-123 which ... 122-311(a) ... number 456-234-2222
Output:
B C D
22-311(a)(1)(d)(4) 10-123 122-311(a)
Solution
To extract all matches from a string, use this pattern:
=REGEXEXTRACT(A1,
REPT(basic_regex & "(?:.*)",
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]",""))))
The tail of a function:
LEN(REGEXREPLACE(REGEXREPLACE(A1,basic_regex, char (9)),"[^"&char(9)&"]","")))
is just for finding number 3 -- how many entries of a pattern in a string.
To not match the phone number you have to indicate that the match must neither be preceded nor followed by \d or -. Google spreadsheet uses RE2 which does not support look around assertion (see the list of supported feature) so as far as I can tell, the only solution is to add a character before and after the match, or the string boundary:
(?:^|[^-\d])\d{2,3}\-\d{2,3}(\([a-zA-Z0-9]\))*(?:$|[^-\d])
(?:^|[^-\d]) means either the start of a line (^) or a character that is not - or \d (you might want to change that, and forbid all letters as well). $ is the end of a line. ^ and $ only do what you want with the /m flag though
As you can see here this finds the correct strings, but with additional spaces around some of the matches.

Get last characters up to specific character

Lets say I have a string something-123.
I need to get last 5 (or less) characters of it but only up to - if there is one in the string, so the result would be like thing, but if string has no - in it, like something123 then the result would be ng123, and if string is like 123 then the result would be 123.
I know how to mach last 5 characters:
/.{5}$/
I know how to mach everything up to first -:
/[^-]*/
But I can not figure out how to combine them, and to make things worse I need to get the match without extracting it from specific groups and similar advanced regex stuff because I want to use it in SQL Anywhere, please help.
Tank you all for the help, but looks like a complete regex solution is going to be too complicated for my problem, so I did it very simple: SELECT right(regexp_substr('something-123', '[^-]*'), 4).
One option is to group the result:
(.{4})-
Now you have captured the result but without the -.
Or using lookarounds you can:
.{4}(?=-)
which matches any 4 characters that appears before "-".
You can use:
.{5}(?=(?:-[^-]*)?$)
See the regex demo
We match 5 symbols other than a newline only before the last - in the string or at the very end of the string ((?=(?:-[^-]*)?$)). You only need to collect the matches, no need checking groups/submatches.
UPDATE
To match any 1 to 5 characters other than a hyphen before the first hyphen (if present in the string), you can use
([^-]{1,5})(?:(?:-[^-]*)*)?$
See demo. We rely on a lookahead here, that checks if there are -+non-hyphen sequences are after the expected substring.
An faster alternative:
^[^-]*?([^-]{1,5})(?:-|$)
This regex will search for any characters other than - up to 1 to 5 such characters.
Note that here, the value we need is in Group 1.
How about:
(.{5})(?:-[^-]+)?$
The result is in group 1
Try this regex:
(.{1,5})(?:-.*|$)
Group 1 has the result you need
demo

Regexp: replacing all [[??]] with {{param|??}}

I'm hoping some regexp guru to help me out with this:
I have strings such as [[AB]], [[ABC]] and [[BEC]], and I want to replace them with string {{param|AB}}, {{param|ABC}} and {{param|BEC}} respectively.
All source strings are inside [[]] and have 2 or 3 upper case letters. The idea is to transfer the letters inside brackets to the new format. It's fine if I need two different regexps for 2 and 3 letter long cases.
(if curious, this is for replacing large number of links with templates in a Mediawiki based page).
Thanks in advance!
You can replace the result of following regex :
/\[\[([A-Z]{2,3})\]\]/
with :
{{/param\|\1/}}
Not that some regex engines use $ for capture group so you may need to use {{/param\|$1/}}
If you want to exclude some words you can use a negative look ahead :
/^\[\[((?!AAA|BBB|CCC)[A-Z]{2,3})]]$/gm
But note that since that preceding regex use anchors if you are dealing with a multiline string you need to use m flag (multiline flag).
See demo https://regex101.com/r/cR8zG6/1
You can search using this regex:
\[\[(\w+)\]\]
and replace using:
{{param|$1}}
RegEx Demo

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ