Simple Regex with Date - regex

I'm using Dreamweaver to replace about 1,000 instances of page titles that have a similar format:
5 5 2016 Nice tasty halibut
5 19 2016 A good king salmon and halibut day
...
I'd like the date to be formatted like:
5-5-2016 Nice tasty halibut
5-19-2016 A good king salmon and halibut day
I tried several ways of using Regular Expressions to fix this, but couldn't get the replaced value with the desired format. Can anyone help me out here?

Suggest using ([0-9]{1,2})\s+([0-9]{1,2})\s+([0-9]{4}).
This is a stricter regexpr. Essentially you are capturing 3 groups of numbers where
Group 1 must be digits and there can only be 1 or 2.
Group 2, same 1 or 2 digits.
Group 3, exactly 4 digits for the year.
Group 4, rest of the string.
And \s+ means 1 or more white spaces.
Then $1-$2-$3 $4 to match back all 4 groups together.
See:
https://regex101.com/r/wO3wD6/1

Search for ([0-9]+) ([0-9]+) (.*) and replace it with $1-$2-$3.

Related

How to regex extract only numbers up to the first comma or after a specific keyword?

I'm having trouble trying to regex extract the 'positions' from the following types of strings:
6 red players position 5, button 2
earn $50 pos3, up to $1,000
earn $50 pos 2, up to $500
table button 4, before Jan 21
I want to get the number that comes after 'pos' or 'position', and if there's no such keyword, get the last number before the first comma. The position value can be a number between 1 and 100. So 'position' for each of the previous rows would be:
Input text
Desired match (position)
6 red players position 5, button 2
5
earn $50 pos3, up to $1,000
3
earn $50 pos 2, up to $500
2
table button 4, before Jan 21
4
I have a big data set (in BigQuery) populated with basically those 4 types of strings.
I've already searched for this type of problem but found no solution or point to start from.
I've tried .+?(?=,) (link) which extracts everything up to the first comma (,), but then I'm not sure how to go about extracting only the numbers from this.
I've tried (?:position|pos)\s?(\d) (link) which extracts what I want for group 1 (by using non-capturing groups), but doesn't solve the 4th type of string.
I feel like there's a way to combine these two, but I just don't know how to get there yet.
And so, after the two things I've tried, I have two questions:
Is this possible with only regex? If so, how?
What would I need to do in SQL to make my life easier at getting these values?
I'd appreciate the help/guidance with this. Thanks a ton!
You can use
^(?:[^,]*[^0-9,])?(\d+),
See the RE2 regex demo. Details:
^ - start of string
(?:[^,]*[^0-9,])? - an optional sequence of:
[^,]* - zero or more chars other than comma
[^0-9,] - a char other than a digit and comma
(\d+) - Group 1: one or more digits
, - a comma
Use look ahead for a comma, with a look behind requiring the previous char to be a space or a letter to prevent matching the “1” in “$1,000”:
(?<=[ a-z])(\d+)(?=,)
See live demo.

Match street number from different formats without suffixes

We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b, "street_number" would be 17, and "street_number_suffix" would be b.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-).
It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/), thought, once I include \s to this pattern, 21 from 19-21 will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+
Thought the full match of N°10 isn't 10 but N°10 (and our ETL doesn't support capturing groups, so I can't use /......(\d+)/)
To get the street numbers, you could update the pattern to:
(?<![-/.a-z\d])\d+
Explanation
(?<! Negative lookbehind
[-/.a-z\d] Match any of the listed using a charater class
) Close the negative lookbehind
\d+ Match 1+ digits
Regex demo

Obtain groups only in regex

Hello Im trying to get only time groups such as (these 3 strings):
'6 hrs'
'16 mins'
'5 hrs 30 mins'
From text
6 hrs blah blah 16 mins blah
blah 5 hrs 30 mins xyz
I tried with this /(\d+ hrs)?( \d+ mins)?/gm ,
but I get 32 matches instead. Please see this https://regex101.com/r/xmoMKz/4
How to get only the groups I want? Without the other matches - I even dont know what are the other matches, positions?
If you get rid of the capturing groups with ?:, you should be able to get what you want. And rather than have both subgroups be optional, it’s better to make the first one not optional and just check for either hrs or mins. So try:
(?:\d+ (?:hrs|mins))(?: \d+ mins)?
Regex101: https://regex101.com/r/nQbrb0/1

Trouble with extracting times from schedule using regex

I am programming a function that will extract the different times from a schedule using regular expressions in Python. Here is an example of the schedule that I got from a website using BeautifulSoup:
Interactive talk with discussion17:00-18:00 Documentary ‘Occupy Gezi’
We present to you Taksim Gezi park boycott with all ways; day and
night, with good sides and bad sides18.00 - 19:00 Poet Maria van
Daalen ‘Haitian Vodoo’, poet from Querido publishers19:00
Food20:30-22:30
As shown above, the input text has starting times with and without ending times. There is also inconsistency with using either “:” or “.” when separating the hours from the minutes.
Using regex101, I have made the following (very ugly) regular expression that seems to work on all different times: \d\d[:|.]\d\d(\s*.\s*\d\d[:|.]\d\d)?
To search the text on Python I use the following code:
def extract_times(string):
list_of_times = re.findall('\d\d[:|.]\d\d(\s*.\s*\d\d[:|.]\d\d)?', string)
return list_of_times
However, when I put the example text from above in this function, it returns this:
['-18:00', ' - 19:00', '', '-22:30']
I expected something like [’17:00-18:00’], [’19:00’].
What have I done wrong?
Use this one : \d{1,2}[:.]([\d\s-]+[:.])?\d{2}}
Explanation
\d{1,2} one or two digits to match 1:00 and 01:00
[:.] to match 18:00 and 18.00
[\d\s-]+ n digit, whitespace or dash (optional)
[:.]\d{2} to match 18:00 and 18.00 (optional)
\d{2} 2 digits
In your sample text, the following will match (use full match) :
Match 1 17:00-18:00
Match 2 18.00 - 19:00
Match 3 19:00
Match 4 20:30-22:30
Demo

Regex to match some dates matching non-dates

I'm using some Regex to find date strings of the form Jan 12, 2015 or Feb 3, 1999.
The regex I'm using is \w+\s\d{1,2},\s\d{4} and it's working correctly, but the thing is that on the file are also some strings with the form:
Weg 58, 4047 or Strasse 1, 4482 and I also match them.
How can I avoid those non-date matches? My approach is:
The first string (the one of the month, Jan, Feb, etc.) has to have always length 3.
The year has to start with 1 or 2.
The thing is that I dont know how can I add these two options to my regex. Any help please?
You can make the test right here: https://regex101.com/r/bN2pO0/1
Thanks in advance.
Since the months won't change (ie: consistent values between January - Decemeber, we can put the 3 starting characters).
We can then use a OR | operator to select years starting with 1 or 2
/((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s(1|2)\d{3})/ig
https://regex101.com/r/bN2pO0/3
Just as you used \d{1,2} to match a digit 1 or 2 times and \d{4} to match a digit 4 times, you can use \w{3} to match a word character 3 times.
For the year, you can use the pipe "or" operator |.
\w{3}\s\d{1,2},\s(?:1|2)\d{3}
Although, this will also match non-dates of form Abc xy, 1xyz
If you want, you can go with brute force approach or just get rid of regex and use code to capture the dates.
Brute force:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s[0-2]?[0-9],\s[12]\d{3}