Open Refine regex for alphabets

Open Refine regex for alphabets - regex

i want to edit only alphabetic charcter from my cell
.
what i have done
value.match(/.*?(\^[a-zA-Z]*$).*?/)
but it returns null
i am try to clean address column in my data set following are the sample address
H3656 GALI#4 BLOCK-D, AREA 1
H#36/17 SECTOR 5D AREA 2
AREA 3 BLOCK-B NORTH NAZIMABAD
GERMANY AL JANNAT BENQUET SECTOR 16 Area 2 with short name
so that i first try to remove all numbers from my string

If you want to remove all the numbers, the most direct approach is probably:
value.replace(/\d+/, "")
If for any reason you want to find only the alphabetic characters, as indicated by the title of your question, this will be more effective than a value.match() :
value.find(/\p{L}\s?/).join("")
(\p{L} is a Java regular expression - Openrefine is written in Java - equivalent to [a-zA-Z], but which also takes into account Unicode characters like accented letters.)
In general, you should avoid using the .match() method unless you know exactly what you are doing. In 90% of cases, it is actually .find() that is desired.

Related

Is there a method for dividing an Address string into 3 separate strings using regex

I am currently working on a project that requires me to divide an address into its street number, its street name, and if it has a suite, into its suite name.
EX: 1360 WHITE OAK RD STE F -----> 1360 | White Oak RD | STE F
I am currently using google sheet and using the =regexextract() functionality that uses Regex to parse the string into different columns. This is how I am currently dividing the number and the street (given the full address is in column B.
=ArrayFormula(REGEXEXTRACT(B1:B,"[0-9]*")) ---->gets the number EX:(1360)
=ArrayFormula(REGEXEXTRACT(B1:B," [a-zA-Z0-9 ]+")) ---->gets the street address including the suite number with a white space at the begining EX:( WHITE OAK RD STE F)
The question I am struggling with is how do I remove the white space from the 2nd formula and also prevent it from getting the suite text (which always starts with STE). Lastly what would be a formula for grabbing the suite text and number.
Thanks and I appreciate any help you can give!

The formula provided by MonkeyZeus works perfectly giving no issues whatsoever.
In case though you have your results in adjacent columns you can use a single formula on every row like
=SPLIT(REGEXREPLACE(B1,"([0-9]+) (.+) (STE.*)","$1♣︎$2♣︎$3"),"♣︎")
Or even use an Arrayformula to get your results for an entire column
=ArrayFormula(IFERROR(SPLIT(REGEXREPLACE(B1:B,"([0-9]+) (.+) (STE.*)","$1♣︎$2♣︎$3"),"♣︎")))
What the formula does
using parenthesis () we divide the text into 3 groups $1, $2, $3
$1♣︎$2♣︎$3 adding the character ♣︎ (could be any character that does not interfere with the formula) we prepare uor text for the SPLIT function
we split our now formed into groups text, to adjacent columns wherever ♣︎ is found
The Arrayformula applies all the above to every single row in column B while IFERROR makes sure we don't get any errors (like when empty cells are found).
Functions used:
ArrayFormula
IFERROR
SPLIT
REGEXREPLACE

For Google Sheets you could use the following 3 formulas:
=REGEXEXTRACT(B1,"^[0-9]*")
=REGEXREPLACE(B1,"^[0-9\s]*|\s*STE.*$", "")
=REGEXEXTRACT(B1,"STE.*$")
I would have used lookbehinds but they are not universally supported in all browsers (yet).
I'm not a Google Sheets expert so I've opted to remove ArrayFormula and replace the B1:B with just B1 since they seemed superfluous.

Regex for UK registration number

I've been playing with creating a regular expression for UK registration numbers but have hit a wall when it comes to restricting overall length of the string in question. I currently have the following:
^(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})
This allows for an optional string (lower or upper case) of between 1 and 3 characters, followed by a mandatory numeric of between 1 and 3 characters and finally, a mandatory string (lower or upper case) of between 1 and 3 characters.
This works fine but I then want to apply a max length of 7 characters to the entire string but this is where I'm failing. I tried adding a 1,7 restriction to the end of the regex but the three 1,3 checks are superseding it and therefore allowing a max length of 9 characters.
Examples of registration numbers that need to pass are as follows:
A1
AAA111
AA11AAA
A1AAA
A11AAA
A111AAA
In the examples above, the A's represents any letter, upper or lower case and the 1's represent any number. The max length is the only restriction that appears not to be working. I disable the entry of a space so they can be assumed as never present in the string.

If you know what lengths you are after, I'd recommend you use the .length property which some languages expose for string length. If this is not an option, you could try using something like so: ^(?=.{1,7})(([a-zA-Z]?){1,3}(\d){1,3}([a-zA-Z]?){1,3})$, example here.

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?

It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.

You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences

Formatting UK postal codes for storage

I want to store UK postal codes in the database. Is it OK to store those postal codes without the spaces?

It is possible to store postcodes without spaces, but would definitely recommend formatting them correctly when they are displayed/output.
You can check out the allowed formats for postcodes here . There are always 3 characters after the space so it's easy to reinsert it.

Last 3 are always xyy
x Digit 0-9
yy Alpha A-Z
Anything before is the first part of the grid reference and has various formats.

we store postcodes and we accept inouts in any format, space or no space, but then strip or correct the entry for data storage
we find it works better this way when using the data for other things
Why would you want to store with no spaces?

Uk postcodes have a variety of formats:
list of formats
Why are you unable to store white spaces?

As others have said, there is no problem with removing all spaces and storing them, if that is what you want to do. As has been said, you can always format them with a space before the last three characters.
However, I would normally take them in any reasonable format, strip all spaces out, and them store them with this one extra space. The storage requirements are not an issue, and it makes it easier to simply display as it is. You would need to resolve the format before saving in some way, so you may as well save it as it is needed.

It's usually safe to remove the space. As others have said, you can re-insert the space later if required. The existence of a space between Outcode and Incode will not normally affect postal delivery. You should not have any non alpha numeric characters in a UK postcode, so if you see a dash you can safely remove it.
I work for Experian Data Quality and if your aim is clean data you may want to consider an address verification web service, like our Pro On Demand product. This will ensure you capture the correct postcode, as they can change over time, and that it is formatted correctly for your database.

It is okay to store without a space because you can always add an empty space back in to each postcode string - the heuristic is pretty simple.
As some other users have very helpfully explained, all UK postcodes have two groups of numbers and letters, separated by a space. The group following the space always contains a number and then two letters (thus, there are always three characters after the space). The group before the space will have either two, three, or four characters (see this Wikipedia page) and the screenshot below.
So, you can recreate the correct spacing by adding a space before the third-to-last character.
In R, it looks like this (but the same logic would work in other languages, such as Python):
#list of example postcodes
postcodes = c("LS176JA", "OX41EZ", "A99AA")
#add space to each postcode in the list of example postcodes
for (postcode in postcodes){
last_three = str_sub(postcode, start = -3)
first_x = str_replace(postcode, last_three, "")
final_postcode = paste0(first_x, " ", last_three)
print(final_postcode)
}
Which returns:
[1] "LS17 6JA"
[1] "OX4 1EZ"
[1] "A9 9AA"

telephone number regex

I am currently trying to validate UK telephone numbers:
The format I'm looking for is: 01234 567891 or 01234567891 - So I need the number to have 5 numbers then a space then 6 numbers or simply a 11 numbers.
The number must start with a 0.
I've had a look at a couple of examples:
/^[0-9]{10,11} - to check that the chars are all numbers
/^0[0-9]{9,10}$/ - to check that the first number is a 0
I'm just unsure how to put all these together and check if there is a space or not.
Could someone help me with this regex?
Thanks

Try this regex:
/^0\d{4}\s?\d{6}$/

Many people try to do input validation and formatting in a single step.
It is better to separate these processes.
Match UK telephone number in any format
^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$
The above pattern allows the user to enter the number in any format they are comfortable with. Don't constrain the user into entering specific formats.
Extract NSN, prefix and extension
^(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)(44)\)?[\s-]?)?\(?0?(?:\)[\s-]?)?([1-9]\d{1,4}\)?[\d[\s-]]+)((?:x|ext\.?|\#)\d{3,4})?$
Next, extract the various elements.
$2 will be '44' if international format was used, otherwise assume national format with leading '0'.
$4 contains the extension number if present.
$3 contains the NSN part.
Validation and formatting
Use further RegEx patterns to check the NSN has the right number of digits for this number range. Finally, store the number in E.164 format or display it in E.123 format.
There's a very detailed list of validation and display formatting RegEx patterns for UK numbers at:
http://www.aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_Formatting_UK_Telephone_Numbers
It's too long to reproduce here and it would be difficult to maintain multiple copies of this document.

If you are looking for all UK numbers, I'd look for a bit more than just that number, some are in the format 020 7123 4567 etc.
^\s*\(?(020[7,8]{1}\)?[ ]?[1-9]{1}[0-9{2}[ ]?[0-9]{4})|(0[1-8]{1}[0-9]{3}\)?[ ]?[1-9]{1}[0-9]{2}[ ]?[0-9]{3})\s*$

/\d*(*)*+*-*/
Simple Telephone Regex includes + () and - anywhere, as well as digits

I think ^0[\d]{4}\s?[\d]{5,6}} will work for you. I have used [\d] instead of [0-9].
I find that RegExr is a useful online tool to check and try your regular expressions. It also has a nice library of examples to help point you in the right direction

you should just count the number of digits and check that it's 10,
Some UK numbers have only 9 digits, not 10 (not including the leading 0).
These include 40 of the 01 area codes (using "4+5" format), the 016977 area code (using "5+4" format), all 0500 numbers and some 0800 numbers.
There's a list at: http://www.aa-asterisk.org.uk/index.php/01_numbers

This US numbers pattern accepts following phones as well:
800-432-4500, Opt: 9, Ext: 100316
800-432-4500, Opt: 9, Ext: X100316
800-432-4500, Option #3
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4}),?(?:\s*(?:#|x\.?|opt(\.|:|\.:)?|option)\s*#?(\d+))?,?(?:\s*(?:#|x\.?|ext(\.|:|\.:)?|extension)\s*(\d+))?
(used this answer in other topic as start point)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Open Refine regex for alphabets - regex

Related

Is there a method for dividing an Address string into 3 separate strings using regex

Regex for UK registration number

How to programmatically learn regexes?

Formatting UK postal codes for storage

telephone number regex

Categories

Resources