Regex possible Whitespaces and trailing characters - regex

I have texts similar to the following (whitespaces intended), which i run a RegEx on line-by-line:
Smith-Petersen X1l
Jonas Henry
Foord. 82a 221.
12345 Somewhere
I now want to use the RegEx to capture anything before 3 or more whitespaces occur (which might or might not occur) in the first match group. The allowed chars:
[a-zA-Z0-9,. '\-AÖÜäöüß]
What I want is : Smith-Petersen, Jonas Henry, Foord. 82a and 12345 Somewhere.
After trying desperately, I hope to find help with this here...I just can't get it to work since my expression grabs the blanks and what follows and puts it into the first group as well. Is there a ways to reverse the way the RegEx? Can anyone help me with this?

Assuming by "may or may not occur" you mean the line may end before 3 spaces are encountered:
^\s*([-a-zA-Z0-9,\.'AÖÜäöüß ]+?)(?=\s{3}|\s{0,2}$)
This regex is using a positive look ahead to assert that either there's 3 spaces following or there's up to 2 spaces then end-of-input.
The anchor to start of input avoids matching the junk at the end of the longer lines.
Your target is in group 1.
See a live demo on rubular

Here is my approach.
^ *([a-zA-Z0-9,.'AÖÜäöüß-]+(?: {1,2}[a-zA-Z0-9,.'AÖÜäöüß-]+)*)
What you want is in match group 1. This regex uses only greedy operators and works on all four cases found in your sample text.
Basically it matches all words at the beginning of a line that are separated from one another by no more than two spaces. Once more than 2 spaces are found, the match is completed.

Related

Regex for a string that contains anything but consecutive spaces

I think it's easiest to start off with an example line
3432543 name that % might 7 include pretty . much 433 anything 545231 4522
I'm looking for an expression that matches the name that % might 7 include pretty . much 433 anything portion so that it includes everything until it encounters 2 spaces, and will not match if it had to include 2 or more spaces.
So for example I want this pattern
^\d+ +(pattern) +\d+$
to end up not matching, and not ending up including the 545231 as part of the name.
Please keep in mind that this is just a simple example to illustrate the problem, it will be included in much more complex expressions matching more complex strings.
You may use this regex to match substring that is only single space separated:
\S+(?:\s\S+)*
RegEx Demo
This match will start with 1+ non-whitespace text followed 0+ of such words separated by a single space only.
For your desired pattern match use:
^\d+\s+\S+(?:\s\S+)*\s+\d+$

Detect multiple periods in Regex and kill entire match

I'm trying to detect a price in regex with this:
^\-?[0-9]+(,[0-9]+)?(\.[0-9]+)?
This covers:
12
12.5
12.50
12,500
12,500.00
But if I pass it
12..50 or 12.5.0 or 12.0.
it still returns a match on the 12 . I want it to negate the entire string and return no match at all if there is more than one period in the entire string.
I've been trying to get my head around negative lookaheads for an hour and have searched on Stack Overflow but can't seem to find the right answer. How do I do this?
What you are looking for, is this:
^\d+(,\d{3})*(\.\d{1,2})?$
What it does:
^ Start of Line
\d+ one or more Digits followed by
(,\d{3})* zero, one or more times a , followed by three Digits followed by
(\.\d{1,2})? one or zero . followed by one or two Digits followed by
$ End of Line
This will only match valid Prices. The Comma (,) is not obligatory in this Regex, but it will be matched.
Look here: http://www.regextester.com/?fam=98001
If you work with Prices and want to store them in a Database I recommend saving them as INT. So 1,234,56 becomes 123456 or 1,234 becomes 123400. After you matched the valid price, all you have to do is to remove the ,s, split the Value by the Dot, and fill the Value of [1] with str_pad() (STR_PAD_RIGHT) with Zeros. This makes Calculations easier, in special when you work with Javascript or other different Languages.
Your regex:
^\-?[0-9]+(,[0-9]+)?(\.[0-9]+)?
Note: The regex you provided does not seem to work for 12 (without "."). Since you didn't add a quantifier after \., it tries to match that pattern literally (.).
While there are multiple ways to solve this and the most "correct" answer will depend on your specific requirements, here's a regex that will not match 12..1, but will match 12.1:
(^\-?[0-9]+(?:,[0-9]+)?(?:\.[0-9]+))+
I surrounded the entire regex you provided in a capturing group (...), and added a one or more quantifier + at the end, so that the entire regex will fail if it does not satisfy that pattern.
Also (this may or may not be what you want), I modified the inner groups into non-capturing groups (?: ... ) so that it does not return unnecessary groups.
This site offers a deconstruction of regexes and explains them:
For the regex provided: https://regex101.com/r/EDimzu/2
Unit tests: https://regex101.com/r/EDimzu/2/tests (Note the 12 one's failure for multiple languages).
You can limit it by requiring there is only 0 or 1 periods like this:
^[0-9,]+[\.]{0,1}?[0-9,]+$

Regex: Match 'no characters' between strings

I have to verify that strings match the following format before the first whitespace (if there is one):
Up to 3 leading letters
At least 4 consecutive digits
Up to 3 trailing letters
To give examples, the following are valid:
1234
Abc123456DeF
1234 blah+
XyZ01234
I'm having trouble avoiding this case however: 123a+b blah
So far I have (^\w{0,3}\d{4}\w{0,3})\s* but the problem lies in making sure a non-letter isn't caught in the first section.
I can see a couple solutions:
Run regex twice, first getting the string up to the first whitespace ([^\s]+) then apply regex again to that making sure it ends in up to 3 letters (^\w{0,3}\d{4}\w{0,3}$). This is what I do now, but surely there's a way to do this in one expression - I just can't figure out how
Make sure no non-letters exist between the (potential) 3 trailing letters and the (potential) whitespace. (^\w{0,3}\d{4}\w{0,3}no non-letters)\s*
I've tried negative lookahead (?!.*) but that doesn't seem to do anything.
This regex satisfy your specifications.
Regex: ^\w{0,3}\d{4,}\w{0,3}\s?$
Explanation:
According to your specifications.
\w{0,3}? Up to 3 leading letters
\d{4,} At least 4 consecutive digits
\w{0,3}? Up to 3 trailing letters
I have to verify that strings match the following format before the first whitespace (if there is one):
\s? hence an optional space.
Regex101 Demo
Note:- I am keeping this as stroked out because there were many shortcomings pointed out in comments. So to maintain the context of comments.
Solution:
Like I said in my comment.
#JCK: Problem is . . even whitespace is optional. Thus making it difficult to differentiate between first and second part.
Now employing a lookahead solves this problem. Complete regex goes like this.
Regex: ^(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$))[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)*$
Explanation:
(?=.*[0-9]{4,}[A-Za-z]{0,3}(?:\s|$)) This positive lookahead makes sure that the first part defined by your specifications is matched. It looks for mentioned specs and either a \s or $ i.e end of string. Thus matching the first part.
[A-Za-z]{0,3}[0-9]{4,}[A-Za-z]{0,3}\s*?(?:\S*\s*)* Rest of the regex is as per the specifications.
Check by entering strings one by one.
Regex: (^[A-Za-z]{0,3}\d{4,}[A-Za-z]{0,3})(?:$|\s+)
\w is same as [A-Za-z0-9_], so to match just letters you should use [A-Za-z].
(?:$|\s+) matches end of string or at least one whitespace (hence ignoring the rest of the string).

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

How to match everything up to the second occurrence of a character?

So my string looks like this:
Basic information, advanced information, super information, no information
I would like to capture everything up to second comma so I get:
Basic information, advanced information
What would be the regex for that?
I tried: (.*,.*), but I get
Basic information, advanced information, super information,
This will capture up to but not including the second comma:
[^,]*,[^,]*
English translation:
[^,]* = as many non-comma characters as possible
, = a comma
[^,]* = as many non-comma characters as possible
[...] is a character class. [abc] means "a or b or c", and [^abc] means anything but a or b or c.
You could try ^(.*?,.*?),
The problem is that .* is greedy and matches maximum amount of characters. The ? behind * changes the behaviour to non-greedy.
You could also put the parenthesis around each .*? segment to capture the strings separately if you want.
I would take a DRY approach, like this:
^([^,]*,){1}[^,]*
This way you can match everything until the n occurrence of a character without repeating yourself except for the last pattern.
Although in the case of the original poster, the group and repetition of the group is useless I think this will help others that need to match more than 2 times the pattern.
Explanation:
^ From the start of the line
([^,]*,) Create a group matching everything except the comma character until it meet a comma.
{1} Count the above pattern (the number of time you need)-1. So if you need 2 put 1, if you need 20 put 19.
[^,]* Repeat the pattern one last time without the tailing comma.
Try this approach:
(.*?,.*?),.*
Link to the solution