Match Regular Expressions patterns if exist, else - regex

Here is what I am trying to achieve. Given a certain set of data I am trying to get the entire row that contains the matching regular expressions that I have.
Essentially, given a data set such as this
AFAM 002A AFAM & DEV AM HIS/GV 03 46493 3 LEC D2 70 P 20/15 W 1800-2045 08/24/16-12/12/16 WSQ 207 K WHITE
AFAM 102 AFRO-AMER MUSIC 01 47200 3 LEC P 5/30 W 1800-2045 08/24/16-12/12/16 MUS 250 V GROCE-ROBERTS
AFAM 125 THE BLACK FAMILY 01 47198 3 LEC P 16/40 M 1800-2045 08/24/16-12/12/16 CCB 101 S MILLNER
AFAM 152 THE BLACK WOMAN 01 47199 3 LEC P 8/40 T 1800-2045 08/24/16-12/12/16 CL 111 R WILSON
AFAM 159 ECON ISSUES BLKCM 01 47197 3 LEC P 11/40 MW 1330-1445 08/24/16-12/12/16 CL 234 R WILSON
AFAM 180 INDIVIDUAL STUDIES 01 46982 3 SUP P 0/10 TBA TBA 08/24/16-12/12/16
The regex that I have created basically groups the following into..
Course ID eg. AFAM 002A
Course Name eg. AFRO-AMER MUSIC
Start date
end date
Professor Name (This is the value that I want to be optional)
The problem that I am having now is that for the optional value, instead of what I what which is to check if it exist, if not then leave empty. If someone could show me the correct way to do this I would greatly appreciated it.
Essentially this part of my regular expression ([A-Z][\s][A-Z]+[-]*[A-Z]+)? Needs to be included if it exist, I understand that that's how the ? operator is supposed to work, however I cant seem to find the right keyword for this question so here I am
([A-Z]+[\s][0-9]+[A-Z]*)(.+)[\s][0-9]+[\s][0-9]+.+(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)[\s]([A-Z][\s][A-Z]+[-]*[A-Z]+)?
The Expected results for this dataset for the last two rows should be
{ [ (AFAM 159), (ECON ISSUES BLKCM), (08/24/16), (12/12/16), (R WILSON)],
[(AFAM 180), (INDIVIDUAL STUDIES), (08/24/16), (12/12/16), ()]
}

Your regex does not match CL 234 in the last but one line. You need to consume it. However, just adding .*? won't work, you need to make your optional pattern obligatory (remove ?) and wrap .*?([A-Z]\s[A-Z]+-*[A-Z]+) with an optional non-capturing group (?:....).
([A-Z]+\s\d+[A-Z]*)(.+?)\s\d+\s\d+.+?(\d\d\/\d\d\/\d\d)-(\d\d\/\d\d\/\d\d)\s(?:.*?([A-Z]\s[A-Z]+-*[A-Z]+))?
See the regex demo.

Related

Regex matching issue to Test-String

i have a problem and dont get it.
My Regex:
My Test-String:
I have two issues and one general question :)
As you can see in my Test-String the very last (german) Phone Number (the big yellow one in the Test-String attachment) does not match my Regex-Pattern correctly. I dont get it, what is the Problem here? the "0049" fits Group 5, but should fit Group 2, why is that?
My second Problem is, how can i get rid of the spaces before and after every match? (The 7 yellow small circles in the Test-String Attachment)
For copy/paste purposes, here is the Regex and Test-String again:
Regex:
((\+\d{2}|00\d{2})?([ ])?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ])?(\d+)?([ ])?(\d+)?)
Test-String:
Vorwahl 089, die E.123 ebenfalls , also (089) 1234567. Die DIN 5008, also +49 89 1234567 respectivly 0049 89 1234567. Die E.123 empfiehlt, also +49 89 123456 0 respectivly 0049 89 123456 0 oder +49 89 123456 789. Also +49 89 123 456 789. Klammern 089/1234567 und 0151 19406041. Test +49 151 123 456 789 respectivly 0049 151 123 456 789
Last but not at least, my general question:
Is it a good approach to Group each logical part as i did in my example?
A last Information: I validate my Regex with https://regex101.com/ and use it in Python with the re Module.
The thing that makes it unpredictable are the numerous optional groups (..)?.
As first step i recommend replacing ([ ])?(\d+)? as a coupled expression ([ ]?\d+)?, which will avoid spaces at the end of the match - your point #2.
As a second step i recommend coupling the first optional space with the expression of the "national dialling": ((\+|00)\d{2}([ ])?)?. Now we are lucky, because it solves both the space at the beginning and the recognition of the whole number, due to less possible matching options.
The new expression now looks like this:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+)?([ ]?\d+)?)
I now recommend to simplify the last part, if you dont need the single group-values:
(((\+|00)\d{2}([ ])?)?(\()?(\d{2,4})(\))?([-| |/])?(\d{3,})([ ]?\d+){0,2})
For better performance I suggest you remove the parenteses/groups where possible or mark them as non-capturing, if you don't need to have the specific group-values.
In some programming languages you will not need to most outer parenteses, as that is always group 0.

How can I find values which does not contains characters and spaces?

Below is sample data from the list I am working with:
74
7491
75
75010
75013
78
8081
84
8400 Winterthu
852
9000 Aalborg
974
A
A CORUÑA
aa
Aalborg
Aargau
Aarhus
aas
AAT
AB
ABERC
Abu Dhabi
Abuja
AC
ACT
AD
Using [^\p{L}-] I can get a list but it also includes the following values which I do not want in the list
Abu Dhabi
Puerto Rico
Hong Kong
How can I do this?
You want to find multiple items, so you must use g option.
You will be checking each line separately. Usual way the pattern
for such case is constructed is ^...$, but both ^ and $
should match begin and end of each line, not the whole string.
So you must use m option.
And the last point, what should be the accepted content of a candidate
line, i.e. what should be between ^ and $: Any not empty sequence of
letters in any language or literal minus, i.e. [\p{L}-]+.
So, to sum up, the whole regex should be:
/^[\p{L}-]+$/gm
This way names containing a space (e.g. Puerto Rico) will not be matched
(as you specified).
Say your file is test.dat
A 1 simple line in grep will give wat you want:
grep -o -P "[0-9]+$" test.dat
Output:
74
7491
75
75010
75013
78
8081
84
852
974

C# regex get just number in specific condition

I want to get number in string at specific position, and i cant do this.
example:
STRING:
180 MATTHEW SANDLER DON 30.00 1.361,67 00
181 JOHN 30.00 5.987,00 99
182 LUCY P. 30.00 3.888,98 71
I want to return on each line just the numbers:
1.361,67
5.987,00
3.888,98
Unfortunately the name has a variable number of spaces, otherwise it would be a simple string.Split(' ') problem
Does anyone know how to do it, please?
The following pattern should match the values in your example:
\b\S*,\d+\b
Example:
http://rextester.com/LZVQN62207
If we conceptually define the term you want to match as being the last term before the final two (or more?) numbers at the end of each line, then we can use the following regex pattern:
(\d+\.\d+,\d+) \d+$
The quantity in parenthesis will be captured and made available after the regex has run in C#.
string input = "180 MATTHEW SANDLER DON 30.00 1.361,67 00";
var groups = Regex.Match(input,#"(\d+\.\d+,\d+) \d+$").Groups;
var x1 = groups[1].Value;
Console.WriteLine(x1);
Demo here:
Rextester

Regex to remove multiple white-spaces stops at first space

I am trying to find the correct regex to remove all white spaces for different formats of strings like:
A 41 FR 38 ( should become A41FR38)
DGT 4687 P ( should become DGT4687P)
POL 789 EU ( should become POL789EU )
I have tried:
[^\s]+
[^\d]+
and many others, none seem to work, they would only stop at the first space? For example POL 789 EU would become POL, and W 85 EU would become W
https://regex101.com/r/kA1sW4/1
Is this possible?
- EDIT -
I have just discovered that the correct different strings would be HTML outputs. Such as :
.html">W 45 B 1 A 401 L</a>
so I have just tryed: html">([^<]*) and it outputs :
W 45 B 1 A 401 L
(still with spaces) What should I add to remove the spaces?
demo (still with spaces) https://regex101.com/r/kA1sW4/2
Even simpler simply use str_replace
echo str_replace(' ','','A 41 FR 38');
Results in:
A41FR38
([^\s]+)/g
The g flag indicates that the regular expression should be tested against all possible matches in a string.

how to put captured group inside a character class to negate it?

I'm trying to parse my poker hand history to determine the number of hands I played 7-2 off suit (that is, the 7 is of one suite, and the 2 is of another).
I can get the hands where I played 77 or 22
$ grep -E "Dealt to .* \[([7|2])[s|c|h|d]\s\1" ~/poker/handhistory/*/* | wc -l
15
And the hands where I played 72 of the same suit.
$ grep -E "Dealt to .* \[([7|2])([s|c|h|d])\s[7|2]\2" ~/poker/handhistory/GMulligan/* | wc -l
9
I've captured the rank of the first card. What I'd like to do is have a character class that contains [7] if the first capture group was 2 and [2] if the first capture group was 7.
can anyone help here?
update:
sorry, some sample data would obviously help here
every hand that player1 is involved in has a line like this:
Dealt to player1 [4c Ac]
i'm looking specifically for all the following within the "[" and "]"
7h 2c
7h 2d
7h 2s
7c 2h
7c 2d
7c 2s
7d 2h
7d 2c
7d 2s
7s 2h
7s 2c
7s 2d
You may be able to use negative lookaheads to achieve what you're trying to do.
https://regex101.com/r/yK4oC7/2 (* denotes matches)
Dealt to player1 []
Dealt to player1 [7c 2c]
Dealt to player1 [7c 2h] *
Dealt to player1 [7d 7c]
Here's a breakdown of the regex \[([72])([sdch]) (?\!\1)([72])(?\!\2)([sdch]) (In bash, ! is a special character and must be escape. So while mamy languages execute a negative lookahead with (?!....), bash appears to need (?\!....).
\[ - match literal [
([72]) - match 7 or 2 and capter as \1
([sdch]) - match s,d,c or h
(?!\1)([72]) - match a space followed by digit that's not the same as \1
and is 7 or 2
(?!\2)([sdch]) - match sdch where it's not the same as whichever of the
four was matched as \1
Edit: I don't use bash, so I'm unfamiliar with the nuances, but the two answers to How to use regex negative lookahead should be useful in devising the proper syntax.