Regex find characters in string - regex

Given the following string
2010-01-01XD2010-01-02XX2010-01-03NX2010-01-04XD2010-01-05DN
I am trying to find all instances of the date followed by one or two characters ie 2010-01-01XD but not where the characters are XX
I have tried
(2010-01-02[^X]{2})|(2010-01-08[^X]{2})|(2010-01-07[^X]{2})|(2010-01-05[^X]{2})|(2010-01-15[^X]{2})
this works if both chars are not X. I have also tried
(2010-01-02[^X]{1,2})|(2010-01-08[^X]{1,2})|(2010-01-07[^X]{1,2})|(2010-01-05[^X]{1,2})|(2010-01-15[^X]{1,2})
this works for for DX but not XD
So trying to be a little clearer
2010-01-01XD
2010-01-01DX
2010-01-01ND
All above should be picked up
2010-01-01XX
And this ignored

You can use this regex based on negative lookahead:
(20[0-9]{2}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12][0-9]|3[01])(?!XX)[A-Z]{2})
RegEx Demo

Easiest way is to use a lookahead assertion (if available).
# (2010-01-01|2010-01-02|2010-01-08|2010-01-07|2010-01-05|2010-01-15)(?!XX)(?i:([a-z]){1,2})
( # (1 start), One of these dates
2010-01-01
| 2010-01-02
| 2010-01-08
| 2010-01-07
| 2010-01-05
| 2010-01-15
) # (1 end)
(?! XX ) # Look ahead assertion, cannot match XX here
(?i: # 1 or 2 of any U/L case letter
( [a-z] ){1,2} # (2)
)

You could likely use a simple pattern with a negtive lookahead such as this:
\d{4}-\d{2}-\d{2}(?!XX)[A-Z]{1,2}
example: http://regex101.com/r/dI1nW4/2
To allow Unicode characters (with the exception of XX) you could use:
\d{4}-\d{2}-\d{2}(?!XX)\D{1,2}
example: http://regex101.com/r/yB5fI0/1

20[0-9]{2}-[01][0-9]-[0-3][0-9]([A-Z][A-WYZ]|[A-WYZ][A-Z])
See it in action.

A negative look ahead is the easiest way to assert the letters not being XX, but there are some simplifications you can make to the alternation by recognising the parts of the date shared by all dates you're trying to match, making this shorter regex:
2010-01-(02|08|07|05|15)(?!XX)[A-Z]{1,2}

Related

Regex pattern for mm/dd/yyyy and mmddyyyy in Scala

I have date in my .txt file which comes like either of the below:
mmddyyyy
OR
mm/dd/yyyy
Below is the regex which works fine for mm/dd/yyyy.
^02\/(?:[01]\d|2\d)\/(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)\/(?:[0-2]\d|3[01])\/(?:19|20)\d{2}|(?:0[469]|11)\/(?:[0-2]\d|30)\/(?:19|20)\d{2}|02\/(?:[0-1]\d|2[0-8])\/(?:19|20)\d{2}$
However, unable to build the regex for mmddyyyy. I just want to understand is there any generic regex that would work for both cases?
Why use regex for this? Seems like a case of "Now you have two problems"
It would be more effective (and easier to understand) to use a DateTimeFormatter (assuming you are on the JVM and not using scala-js)
The format patterns support using [] to surround optional sections, such as the /, and the formatters inherently perform input validation so if you plug in a month or day that can't exist, it'll throw an exception.
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val mdy = DateTimeFormatter.ofPattern("MM[/]dd[/]yyyy")
def parse(rawDate: String) = LocalDate.parse(rawDate, mdy)
scala> parse("12252022")
res7: java.time.LocalDate = 2022-12-25
scala> parse("12/25/2022")
res8: java.time.LocalDate = 2022-12-25
scala> parse("25/12/2022")
java.time.format.DateTimeParseException: Text '25/12/2022' could not be parsed: Invalid value for MonthOfYear (valid values 1 - 12): 25
scala> parse("abc123")
java.time.format.DateTimeParseException: Text 'abc123' could not be parsed at index 0
If you want to match all those variations with either 2 forward slashes or only digits, you can use a positive lookahead to assert either only digits or 2 forward slashes surrounded by digits.
Then in the pattern itself you can make matching the / optional.
Note that you don't have to escape the \/
^(?=\d+(?:/\d+/\d+)?$)(?:02/?(?:[01]\d|2\d)/?(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)/?(?:[0-2]\d|3[01])/?(?:19|20)\d{2}|(?:0[469]|11)/?(?:[0-2]\d|30)/?(?:19|20)\d{2}|02/?(?:[0-1]\d|2[0-8])\?(?:19|20)\d{2})$
Regex demo
Another option is to write an alternation | matching the same pattern without the / in it.
First of all, there is a tiny shortcoming in your regex: the ^ anchor only applies to the first part of your regex, not to the other alternatives that are separated by |. Similarly the final $ applies only to the final alternative. You should put all alternatives in a non-capturing group, like ^(?: | | | )$
Then for the question itself, you could make the forward slash that follows the month optional and put it in a capture group. Then what comes between the day and the year could be a backreference to that capture group. So (\/?) and \1.
^(?:02(\/?)(?:[01]\d|2\d)\1(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)(\/?)(?:[0-2]\d|3[01])\2(?:19|20)\d{2}|(?:0[469]|11)(\/?)(?:[0-2]\d|30)\3(?:19|20)\d{2}|02(\/?)(?:[0-1]\d|2[0-8])\4(?:19|20)\d{2})$

Make sure regex does not match empty string - but with a few caveats

There is a problem that I need to do, but there are some caveats that make it hard.
Problem: Match on all non-empty strings over the alphabet {abc} that contain at most one a.
Examples
a
abc
bbca
bbcabb
Nonexample
aa
bbaa
Caveats: You cannot use a lookahead/lookbehind.
What I have is this:
^[bc]*a?[bc]*$
but it matches empty strings. Maybe a hint? Idk anything would help
(And if it matters, I'm using python).
As I understand your question, the only problem is, that your current pattern matches empty strings. To prevent this you can use a word boundary \b to require at least one word character.
^\b[bc]*a?[bc]*$
See demo at regex101
Another option would be to alternate in a group. Match an a surrounded by any amount of [bc] or one or more [bc] from start to end which could look like: ^(?:[bc]*a[bc]*|[bc]+)$
The way I understood the issue was that any character in the alphabet should match, just only one a character.
Match on all non-empty strings over the alphabet... at most one a
^[b-z]*a?[b-z]*$
If spaces can be included:
^([b-z]*\s?)*a?([b-z]*\s?)*$
You do not even need a regex here, you might as well use .count() and a list comprehension:
data = """a,abc,bbca,bbcabb,aa,bbaa,something without the bespoken letter,ooo"""
def filter(string, char):
return [word
for word in string.split(",")
for c in [word.count(char)]
if c in [0,1]]
print(filter(data, 'a'))
Yielding
['a', 'abc', 'bbca', 'bbcabb', 'something without the bespoken letter', 'ooo']
You've got to positively match something excluding the empty string,
using only a, b, or c letters. But can't use assertions.
Here is what you do.
The regex ^(?:[bc]*a[bc]*|[bc]+)$
The explanation
^ # BOS
(?: # Cluster choice
[bc]* a [bc]* # only 1 [a] allowed, arbitrary [bc]'s
| # or,
[bc]+ # no [a]'s only [bc]'s ( so must be some )
) # End cluster
$ # EOS

Regular Expression to match many coordinate formats

I am working on a regex that will match many different types of of location coordinates. So far it matches about 90% of the formats:
([SNsn][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([SNsn]?)[^\\dSNsnEWew]+([EWew][\\s]*)?((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ms'′""″,\\.\\dNEWnew]?)|(?:[^ms'′""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))(?:(?:[^ds°""″,\\.\\dNEWnew]?)|(?:[^ds°""″,\\.\\dNEWnew]+((?:[\\+-]?[0-9]*[\\.,][0-9]+)|(?:[\\+-]?[0-9]+))[^dm°'′,\\.\\dNEWnew]*))))([EWew]?)
Testing the formats:
N 45° 55.732 W 122° 29.882
N 047° 38.938', W 122° 20.887'
40.123, -74.123
40.123° N 74.123° W
40° 7´ 22.8" N 74° 7´ 22.8" W
40° 7.38’ , -74° 7.38’
N40°7’22.8, W74°7’22.8"
40°7’22.8"N, 74°7’22.8"W
40 7 22.8, -74 7 22.8
40.123 -74.123
40.123°,-74.123°
144442800, -266842800
40.123N74.123W
4007.38N7407.38W
40°7’22.8"N, 74°7’22.8"W
400722.8N740722.8W
N 40 7.38 W 74 7.38
40:7:23N,74:7:23W
40:7:22.8N 74:7:22.8W
40°7’23"N 74°7’23"W
40°7’23" -74°7’23"
40d 7’ 23" N 74d 7’ 23" W
40.123N 74.123W
40° 7.38, -74° 7.38
Testing if it works: https://regexr.com/3ivu2
As you can see there are issues with the spaces and commas that are causing the regex to not match some of these formats.
I am trying to match the coordinate strings so that they can be highlighted in my iOS app and allow the user to tap them.
What can I do to update the regex and fix the matching issues?
Overview
I'm sure there are many ways to go about this. Since you haven't specified a regex engine or programming language, I'll post one that works in PCRE and what that should work in most engines. The PCRE regex is much easier to understand than the non-PCRE regex, but both use the exact same logic.
The patterns defined below match each string you've presented in your question and properly separates each part of the coordinate (x, y).
Code
PCRE
This method uses the DEFINE construct to pre-define patterns. The beauty of this construct is that you can define reusable parts of your regex in one location, thus, you can edit most of the regex just by editing these subpatterns.
See regex in use here
(?(DEFINE)
(?<ns>[ns])
(?<ew>[ew])
(?<d>[°´’'"d:])
(?<n>[+-]?\d+(?:\.\d+)?)
)
(
(?&ns)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ns)?
)
\ ?,?\ ?
(
(?&ew)?
(?:\ ?(?&n)(?&d)?){1,3}
\ ?(?&ew)?
)
Flags: gix
Non-PCRE
See regex in use here
(
[ns]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ns]?
)
\ ?,?\ ?
(
[ew]?
(?:\ ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3}
\ ?[ew]?
)
Flags: gix.
Some engines don't have the x flag. For those engines you can use the following one-liner (as seen here):
([ns]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ns]?) ?,? ?([ew]?(?: ?[+-]?\d+(?:\.\d+)?[°´’'"d:]?){1,3} ?[ew]?)
Explanation
Since both patterns are essentially the same (non-PCRE is just an expanded version of the PCRE), I'll define the PCRE regex pattern since it's easier to grasp.
Note that the patterns that use x have escaped spaces since they would otherwise be ignored (x ignores whitespace within the pattern). The i flag allows us to match text regardless of case (i makes our pattern case-insensitive).
DEFINE
(?(DEFINE)...) The DEFINE group is completely ignored by regex. It gets treated as a var name=value, whereas you can recall the specific pattern for use via its name.
(?<ns>[ns]) The group ns matches any character in the set nsNS
(?<ew>[ew]) The group ew matches any character in the set ewEW
(?<d>[°´’'"d:]) The group d matches any character in the set °´’'"d:
(?<n>[+-]?\d+(?:\.\d+)?) The group n matches any number that matches the following structure
[+-]? Optionally match any character in the set +-
\d+ Match one or more digits
(?:\.\d+)? Optionally match a decimal point followed by one or more digits
Pattern
The pattern is composed of 3 larger parts. The first and last are capture groups (the coordinates themselves) and the second is what separates the two.
Capture 1:
(?&ns)? Optionally match the group ns
(?:\ ?(?&n)(?&d)?){1,3} Matches [an optional space, followed by the group n then optionally group d] between one and three times
\ ?(?&ns)? Optionally match a space, optionally match the group ns
\ ?,?\ ? Match an optional space, comma and space (this separates each coordinate part)
Capture 2: This is the same as Capture 1 but replaces the group ns with the group ew
This simplified regex literally matches all the patterns you've given:
^((?:[NW]? ?(?:[-\d.d]+[NW:°´’'",]?[ NW]?)+[, ]*)+[NW]?)$
I'm not an expert for coordinates, but you can modify it easily if I didn't take into account some specifics.
A full test is here.

Regular Expression issue with * laziness

Sorry in advance that this might be a little challenging to read...
I'm trying to parse a line (actually a subject line from an IMAP server) that looks like this:
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
It's a little hard to see, but there are two =?/?= pairs in the above line. (There will always be one pair; there can theoretically be many.) In each of those =?/?= pairs, I want the third argument (as defined by a ? delimiter) extracted. (In the first pair, it's "Here is som", and in the second it's "e text.")
Here's the regex I'm using:
=\?(.+)\?.\?(.*?)\?=
I want it to return two matches, one for each =?/?= pair. Instead, it's returning the entire line as a single match. I would have thought that the ? in the (.*?), to make the * operator lazy, would have kept this from happening, but obviously it doesn't.
Any suggestions?
EDIT: Per suggestions below to replace ".?" with "[^(\?=)]?" I'm now trying to do:
=\?(.+)\?.\?([^(\?=)]*?)\?=
...but it's not working, either. (I'm unsure whether [^(\?=)]*? is the proper way to test for exclusion of a two-character sequence like "?=". Is it correct?)
Try this:
\=\?([^?]+)\?.\?(.*?)\?\=
I changed the .+ to [^?]+, which means "everything except ?"
A good practice in my experience is not to use .*? but instead do use the * without the ?, but refine the character class. In this case [^?]* to match a sequence of non-question mark characters.
You can also match more complex endmarkers this way, for instance, in this case your end-limiter is ?=, so you want to match nonquestionmarks, and questionmarks followed by non-equals:
([^?]*\?[^=])*[^?]*
At this point it becomes harder to choose though. I like that this solution is stricter, but readability decreases in this case.
One solution:
=\?(.*?)\?=\s*=\?(.*?)\?=
Explanation:
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
\s* # Match spaces.
=\? # Literal characters '=?'
(.*?) # Match each character until find next one in the regular expression. A '?' in this case.
\?= # Literal characters '?='
Test in a 'perl' program:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group 1 -> %s\nGroup 2 -> %s\n], $1, $2 if m/=\?(.*?)\?=\s*=\?(.*?)\?=/;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?=
Running:
perl script.pl
Results:
Group 1 -> utf-8?Q?Here is som
Group 2 -> utf-8?Q?e text.
EDIT to comment:
I would use the global modifier /.../g. Regular expression would be:
/=\?(?:[^?]*\?){2}([^?]*)/g
Explanation:
=\? # Literal characters '=?'
(?:[^?]*\?){2} # Any number of characters except '?' with a '?' after them. This process twice to omit the string 'utf-8?Q?'
([^?]*) # Save in a group next characters until found a '?'
/g # Repeat this process multiple times until end of string.
Tested in a Perl script:
use warnings;
use strict;
while ( <DATA> ) {
printf qq[Group -> %s\n], $1 while m/=\?(?:[^?]*\?){2}([^?]*)/g;
}
__DATA__
=?utf-8?Q?Here is som?= =?utf-8?Q?e text.?= =?utf-8?Q?more text?=
Running and results:
Group -> Here is som
Group -> e text.
Group -> more text
Thanks for everyone's answers! The simplest expression that solved my issue was this:
=\?(.*?)\?.\?(.*?)\?=
The only difference between this and my originally-posted expression was the addition of a ? (non-greedy) operator on the first ".*". Critical, and I'd forgotten it.

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`