Extract with regex when the same special character is used - regex

I've been trying to use Regex tools online, but none seem to be working. I am close but not sure what I'm missing.
Here is the Text:
Valencia, Los Angeles, California - Map
I want to extract the first 2 letters of the state (so between "," and "-"), in this case "CA"
What I've done so far is:
[,/](.*)[-/]
$1
The output is:
Los Angeles, California
If anything I thought I would at least just get the state.

,\s*(\w\w)[^,]*-
will capture Ca in group 1.
, comma
\s* whitespace
(\w\w) capture the first two characters
[^,]* make sure there's no comma up to the next dash
-

,\s*(\S{2})[^,]*-
You're going to want to take just the first match.

I assume you use JavaScript.
Your regex fails this particular case because there are two commas in your input.
One possible fix is to modify the middle capture from . (any character) to [^,] (any character except comma). This will force the regex to match California only.
So, try [,/]([^,]*)[-/]. Here's a demo of how it works.

You can use this regex:
.*?,\s(\w\w)[^,]*-
$1 is the first two letters you're looking for.

Related

Make regex match only the capturing group

Due to the technology I'm currently working with (PySpark API), I need to adjust a regex so that the full match corresponds to the capturing group.
I want to use it as a delimiter pattern in a split function
This function splits an input string according to the matched substring, not the capturing group.
Hence why I need to match the \s+ caracters (that I currently only capture).
Here is a regex101 example or here: (\s)+(?:\d*\s*)(?=RUE|BOULEVARD|AVENUE)
I tried to extend the positive lookahead to combine the possibility that a \d+\s+ may be present before and therefore match a different \s. Didnt work so far.
The split function's output I wish to obtain is the following:
[7 BOULEVARD LAPIN BLANC,AVENUE MR LIEVRE,18 RUE PIERRE LAPIN]
I don't know pyspark but I guess it supports these things, split on spaces that are not preceded by a digit but followed by an optional digit then the type of street.
(?<!\d)\s+(?=(?:\d+\s)?(?:RUE|BOULEVARD|AVENUE))
In the demo I use a substitution with \n that simulate the split.
Demo & explanation

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Trying to capture two lines before a found regex

I'm trying to figure out how to capture a regex and the two lines previous to it.
Example:
Santa Claus
North Pole, North Pole
H0H 0H0
The regex I have is for the Postal Code [a-z]{1}\d{1}[a-z]{1}\s\d{1}[a-z]{1}\d{1}
I want to be able to capture that result and the previous two lines as well using on regex expression.
Does anyone have any ideas?
Thank you in advance.
You could use the following:
(.*\n.*\n[a-z]\d[a-z]\s\d[a-z]\d)
Example Here
.*\n.*\n will match all characters on the previous two lines.
[a-z]\d[a-z]\s\d[a-z]\d - I removed {1} after each character class (since only one will be matched by default, this is redundant).
You may also need to add the case-insensitive i flag since [a-z] will only match lowercase characters. Otherwise that should be replaced with [A-Za-z] to catch the capital letters in the postal codes.

Regex possible Whitespaces and trailing characters

I have texts similar to the following (whitespaces intended), which i run a RegEx on line-by-line:
Smith-Petersen X1l
Jonas Henry
Foord. 82a 221.
12345 Somewhere
I now want to use the RegEx to capture anything before 3 or more whitespaces occur (which might or might not occur) in the first match group. The allowed chars:
[a-zA-Z0-9,. '\-AÖÜäöüß]
What I want is : Smith-Petersen, Jonas Henry, Foord. 82a and 12345 Somewhere.
After trying desperately, I hope to find help with this here...I just can't get it to work since my expression grabs the blanks and what follows and puts it into the first group as well. Is there a ways to reverse the way the RegEx? Can anyone help me with this?
Assuming by "may or may not occur" you mean the line may end before 3 spaces are encountered:
^\s*([-a-zA-Z0-9,\.'AÖÜäöüß ]+?)(?=\s{3}|\s{0,2}$)
This regex is using a positive look ahead to assert that either there's 3 spaces following or there's up to 2 spaces then end-of-input.
The anchor to start of input avoids matching the junk at the end of the longer lines.
Your target is in group 1.
See a live demo on rubular
Here is my approach.
^ *([a-zA-Z0-9,.'AÖÜäöüß-]+(?: {1,2}[a-zA-Z0-9,.'AÖÜäöüß-]+)*)
What you want is in match group 1. This regex uses only greedy operators and works on all four cases found in your sample text.
Basically it matches all words at the beginning of a line that are separated from one another by no more than two spaces. Once more than 2 spaces are found, the match is completed.

remove repeated character between words

I am trying out the quiz from Regex 101
In Task 6, the question is
Oh no! It seems my friends spilled beer all over my keyboard last night and my keys are super sticky now. Some of the time when I press a key, I get two duplicates. Can you pppllleaaaseee help me fix this? Content in bold should be removed.
I have tried this regex
([a-z])(\1{2})
But couldn't get the solution.
The solution for the riddle on that website is:
/(.)\1{2}/g
Since any key on the keyboard can get stuck, so we need to use ..
\1 in the regex means match whatever the 1st capturing group (.) matches.
Replacement is $1 or \1.
The rest of your regex is correct, just that there are unnecessary capturing groups.
Your regex is correct if you want to match exactly three characters. If you want to match at least three, that is
([a-z])(\1{2,})
or
([a-z])(\1\1+)
Since you don't need to capture anything but the first occurence, these are slightly better:
([a-z])\1{2} # your original regex (exactly three occurences)
([a-z])\1{2,}
([a-z])\1\1+
Now, the replacement should be exactly one occurence of the character, and nothing more:
\1
Replace:
(.)\1+
with:
\1
This of course requires that your regex engine suports backreferences... Also, in the replacement part, and according to regex engines, \1 may have to be written as $1.
I'd do it with (\w)(\1+)? but can't find out how to "remove" within the given site...
Best way would be to replace the results of the secound match with empty strings