Error trapping with regex

Error trapping with regex - regex

I have the following dataframe
ColumnA=c("Kuala Lumpur Sector 2 new","old Jakarta Sector31", "Sector 9, 7 Hong Kong","Jakarta new Sector22")
and am extracting the Sector number to a separate column
gsub(".*Sector ?([0-9]+).*","\\1",ColumnA)
Is there a more elegant way to capture errors if 'Sector' does not appear on one line than an if else statement?
If the word 'Sector' does not appear on one line I simply want to set the value of that row to blank.
I thought of using str_detect first to see if 'Sector' was there TRUE/FALSE, but this is quite an ugly solution.
Thanks for any help.

If the word 'Sector' does not appear on one line I simply want to set the value of that row to blank.
To achieve that, use alternation operator |:
ColumnA=c("Kuala Lumpur 2 new","old Jakarta Sector31", "Sector 9, 7 Hong Kong","Jakarta new Sector22")
gsub("^(?:.*Sector ?([0-9]+).*|.*)$","\\1",ColumnA)
Result: [1] "" "31" "9" "22" (as Kuala Lumpur 2 new has no Sector, the second part with no capturing group matched the whole string).
See IDEONE demo

library(stringr)
as.vector(sapply(str_extract(ColumnA, "(?<=Sector\\s{0,10})([0-9]+)"),function(x) replace(x,is.na(x),'')))
I think this is what you need.

Related

How to regex extract only numbers up to the first comma or after a specific keyword?

I'm having trouble trying to regex extract the 'positions' from the following types of strings:
6 red players position 5, button 2
earn $50 pos3, up to $1,000
earn $50 pos 2, up to $500
table button 4, before Jan 21
I want to get the number that comes after 'pos' or 'position', and if there's no such keyword, get the last number before the first comma. The position value can be a number between 1 and 100. So 'position' for each of the previous rows would be:
Input text
Desired match (position)
6 red players position 5, button 2
5
earn $50 pos3, up to $1,000
3
earn $50 pos 2, up to $500
2
table button 4, before Jan 21
4
I have a big data set (in BigQuery) populated with basically those 4 types of strings.
I've already searched for this type of problem but found no solution or point to start from.
I've tried .+?(?=,) (link) which extracts everything up to the first comma (,), but then I'm not sure how to go about extracting only the numbers from this.
I've tried (?:position|pos)\s?(\d) (link) which extracts what I want for group 1 (by using non-capturing groups), but doesn't solve the 4th type of string.
I feel like there's a way to combine these two, but I just don't know how to get there yet.
And so, after the two things I've tried, I have two questions:
Is this possible with only regex? If so, how?
What would I need to do in SQL to make my life easier at getting these values?
I'd appreciate the help/guidance with this. Thanks a ton!

You can use
^(?:[^,]*[^0-9,])?(\d+),
See the RE2 regex demo. Details:
^ - start of string
(?:[^,]*[^0-9,])? - an optional sequence of:
[^,]* - zero or more chars other than comma
[^0-9,] - a char other than a digit and comma
(\d+) - Group 1: one or more digits
, - a comma

Use look ahead for a comma, with a look behind requiring the previous char to be a space or a letter to prevent matching the “1” in “$1,000”:
(?<=[ a-z])(\d+)(?=,)
See live demo.

Regex for values that are in between spaces

I am new to regex and having difficulty obtaining values that are caught in between spaces.
I am trying to get the values "field 1" "abc/def try" from the sameple data below just using regex
Currently im using (^.{18}\s+) to skip the first 18 characters, but am at at loss of how to do grab values with spaces between.
A1234567890 field 1 abc/def try
02021051812 12 test test 12 pass
3333G132021 no test test cancel
any help/pointers will be appreciated.

If this text has fixed-width columns, you can match and trim the column values knowing the amount of chars between start of string and the column text.
For example, this regex will work for the text you posted:
^(.*?)\s*(?<=.{19})(.*?)\s*(?<=^.{34})(.*?)\s*(?<=^.{46})
See the regex demo.
So, Column 2 starts at Position 19, Column 3 starts at Position 34 and Column 4 (end of string here) is at Position 46.
However, this regex is not that efficient, and it would be really great if the data format is fixed on the provider's side.

Given the not knowing if the data is always the same length I created the following, which will provide you with a group per column you might want to use:
^((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)
Regex demo

Regex to match some dates matching non-dates

I'm using some Regex to find date strings of the form Jan 12, 2015 or Feb 3, 1999.
The regex I'm using is \w+\s\d{1,2},\s\d{4} and it's working correctly, but the thing is that on the file are also some strings with the form:
Weg 58, 4047 or Strasse 1, 4482 and I also match them.
How can I avoid those non-date matches? My approach is:
The first string (the one of the month, Jan, Feb, etc.) has to have always length 3.
The year has to start with 1 or 2.
The thing is that I dont know how can I add these two options to my regex. Any help please?
You can make the test right here: https://regex101.com/r/bN2pO0/1
Thanks in advance.

Since the months won't change (ie: consistent values between January - Decemeber, we can put the 3 starting characters).
We can then use a OR | operator to select years starting with 1 or 2
/((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{1,2},\s(1|2)\d{3})/ig
https://regex101.com/r/bN2pO0/3

Just as you used \d{1,2} to match a digit 1 or 2 times and \d{4} to match a digit 4 times, you can use \w{3} to match a word character 3 times.
For the year, you can use the pipe "or" operator |.
\w{3}\s\d{1,2},\s(?:1|2)\d{3}
Although, this will also match non-dates of form Abc xy, 1xyz
If you want, you can go with brute force approach or just get rid of regex and use code to capture the dates.
Brute force:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s[0-2]?[0-9],\s[12]\d{3}

Regex to add leading zero in date record

Question - what is the shortest form of regex to add a leading zero into single digit in date record?
So I want to convert 8/8/2014 8:04:34 to 08/08/2014 8:04:34 - add leading zero when only one digit is presented.
The record can have two single digit entry, one single digit entry or no single digit entry. Some records can be in forms like 25/06/2014 19:50:18 or 9/06/2014 8:27:35 - in other words, some of them could be already normalized and regex needs to fix only single digit entry.
Not a regex user by any means. Your help is appreciated.

How about:
Ctrl+H
Find what: \b(\d)(?=/)
Replace with: 0$1
Replace all
This will change 8/8/2014 8:04:34 into 08/08/2014 8:04:34

Use the following regex to find:
(\d)(\d)?/(\d)(\d)?/(.*)
Then use the following to replace:
(?{2}\1\2:0\1)/(?{4}\3\4:0\3)/\5
What we are using is called conditionals in terms of regex. Refer this answer for explanation.
Make sure you have unselected the checkbox which says ". matches newline".

First of all, let's do some test-driven development and write the test cases. We can ignore the time and concentrate on the date alone. Also, the year is not important. We have to find all the possible cases for the day and the month. For each of them, we can have:
A single digit
Two digits, the first of which is already a 0
Two digits, the first of which is not a 0
Two digits, the second of which is a 0 (probably not needed, but just in case).
The case where we have to do something is only the first one, and the last 3 could be joined into a single one, but I prefer to keep them separated. We need to test 16 combinations:
8/8/2014
8/08/2014
8/12/2014
8/10/2014
08/8/2014
08/08/2014
08/12/2014
08/10/2014
12/8/2014
12/08/2014
12/12/2014
12/10/2014
10/8/2014
10/08/2014
10/12/2014
10/10/2014
Of all of these, only 1, 2, 3, 4, 5, 9, 13 must be changed. I don't know how to do it with a single regex, but with 2 regexes it's easy:
First regex, for the day:
(?<!\d)(\d/\d{1,2}/\d+)
replace with:
0\1
It matches a date where the day has only one digit, followed by a month with either 1 or 2 days, followed by a year with any number of digits, and it simply adds a 0 at the beginning.
Second regex, for the month:
(\d{2}/)(\d/\d+)
replace with:
\10\2
This one assumes that the first one has already been run, and thus the day has 2 digits. It finds dates where the month has a single digit, and adds a 0 before it. Please note that \10\2 means: the first group that matched, followed by a 0, followed by the second group. It doesn't mean: the tenth group, followed by the second. So the digits 1 and 0 are logically separated.
Run the first one, then the second one, and it gives the correct result:
08/08/2014
08/08/2014
08/12/2014
08/10/2014
08/08/2014
08/08/2014
08/12/2014
08/10/2014
12/08/2014
12/08/2014
12/12/2014
12/10/2014
10/08/2014
10/08/2014
10/12/2014
10/10/2014

Thanks to this recent answer I finally can give you an (hopefully) correct answer ;)
Replace
\b(?:(\d\d)|(\d))/(?:(\d\d)|(\d))/(\d\d)
with
(?{1}\1:0$2)/(?{3}\3:0\4)/\5
It uses Notepad++ conditionals (which I didn't know of until I stumbled over the mention question) to handle when only one or the other is single digit.
The regex matches a word boundary \b followed by two digits, captured in group 1, or one digit, captured in group 2, followed by a /. Then the same logic is repeated for day, which is captured in group 3 (2 digit) or 4 (1 digit). Then finally it checks that a year follows (at least two digits).
The conditional replace is explained in the linked answer. But simply put the (?{1} test if a match to group 1 was made it replaces with the expression before the :, otherwise the one after.
Hope this helps.
Regards

If you had a date like (ISO format)
2017-9-5
This
replace(/(\D)(\d)(?!\d)/g, '$10$2')
will turn it into
2017-09-05
and will preserve two digits in dates like
2017-11-11 or 2017-9-05

a general approach is to search for (in this case 5 digit numbers):
(\d)??(\d)??(\d)??(\d)??(\d)
Replace with
(?1\1:0)(?2\2:0)(?3\3:0)(?4\4:0)\5

You can use /^\d\/|(?<=\/)\d\/\d/g to select text, then add 0 before selected text, it should work for all your conditions.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA

This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).

Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.

I think this workflow will help you :

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Error trapping with regex - regex

library(stringr) as.vector(sapply(str_extract(ColumnA, "(?<=Sector\\s{0,10})([0-9]+)"),function(x) replace(x,is.na(x),''))) I think this is what you need.

Related

How to regex extract only numbers up to the first comma or after a specific keyword?

Regex for values that are in between spaces

Regex to match some dates matching non-dates

Regex to add leading zero in date record

Regex parse with alteryx

Categories

Resources