Formatting UK postal codes for storage - postal-code

I want to store UK postal codes in the database. Is it OK to store those postal codes without the spaces?

It is possible to store postcodes without spaces, but would definitely recommend formatting them correctly when they are displayed/output.
You can check out the allowed formats for postcodes here . There are always 3 characters after the space so it's easy to reinsert it.

Last 3 are always xyy
x Digit 0-9
yy Alpha A-Z
Anything before is the first part of the grid reference and has various formats.

we store postcodes and we accept inouts in any format, space or no space, but then strip or correct the entry for data storage
we find it works better this way when using the data for other things
Why would you want to store with no spaces?

Uk postcodes have a variety of formats:
list of formats
Why are you unable to store white spaces?

As others have said, there is no problem with removing all spaces and storing them, if that is what you want to do. As has been said, you can always format them with a space before the last three characters.
However, I would normally take them in any reasonable format, strip all spaces out, and them store them with this one extra space. The storage requirements are not an issue, and it makes it easier to simply display as it is. You would need to resolve the format before saving in some way, so you may as well save it as it is needed.

It's usually safe to remove the space. As others have said, you can re-insert the space later if required. The existence of a space between Outcode and Incode will not normally affect postal delivery. You should not have any non alpha numeric characters in a UK postcode, so if you see a dash you can safely remove it.
I work for Experian Data Quality and if your aim is clean data you may want to consider an address verification web service, like our Pro On Demand product. This will ensure you capture the correct postcode, as they can change over time, and that it is formatted correctly for your database.

It is okay to store without a space because you can always add an empty space back in to each postcode string - the heuristic is pretty simple.
As some other users have very helpfully explained, all UK postcodes have two groups of numbers and letters, separated by a space. The group following the space always contains a number and then two letters (thus, there are always three characters after the space). The group before the space will have either two, three, or four characters (see this Wikipedia page) and the screenshot below.
So, you can recreate the correct spacing by adding a space before the third-to-last character.
In R, it looks like this (but the same logic would work in other languages, such as Python):
#list of example postcodes
postcodes = c("LS176JA", "OX41EZ", "A99AA")
#add space to each postcode in the list of example postcodes
for (postcode in postcodes){
last_three = str_sub(postcode, start = -3)
first_x = str_replace(postcode, last_three, "")
final_postcode = paste0(first_x, " ", last_three)
print(final_postcode)
}
Which returns:
[1] "LS17 6JA"
[1] "OX4 1EZ"
[1] "A9 9AA"

Related

Regex to remove first two characters of a string if they are alphabets

My client uses SKUs from which they change the first two digit suffix to represent changes/updates in models. As an analyst, I need to make a unique list of SKUs to use in my data studio dashboard. A sample of the SKUs would look like:
NP9151BM01
NL9151BM01
NL6004SL01
NN6004SL01
NP1927YM05
NN1927YM05
NQ1296BM01
NG1296BM01
NQ1044YL04
NN1044YL04
NP9151YM05
9151YM05
1044YL04
I need to use regex to check if the first two characters are alphabets and remove them if they are. For example, if I have NP9151BM01 and NL9151BM01 as SKUs, I need to remove NP and NL from them to end up with the exact same SKU. However, if I have 9151YM05 or 1044YL04 as SKUs, I need to keep it as it is.
For my solution, I have researched on google and stack overflow and I've found this regex (?<=^..).*$ which will remove the first two characters in all SKUs but I'm not sure how to customise it to only remove the first two characters if they are alphabets.
I would appreciate any help that I can get with this!
To remove the first two alphabets:
=REGEXREPLACE(A2,"^[A-Z]{2}",)

Open Refine regex for alphabets

i want to edit only alphabetic charcter from my cell
.
what i have done
value.match(/.*?(\^[a-zA-Z]*$).*?/)
but it returns null
i am try to clean address column in my data set following are the sample address
H3656 GALI#4 BLOCK-D, AREA 1
H#36/17 SECTOR 5D AREA 2
AREA 3 BLOCK-B NORTH NAZIMABAD
GERMANY AL JANNAT BENQUET SECTOR 16 Area 2 with short name
so that i first try to remove all numbers from my string
If you want to remove all the numbers, the most direct approach is probably:
value.replace(/\d+/, "")
If for any reason you want to find only the alphabetic characters, as indicated by the title of your question, this will be more effective than a value.match() :
value.find(/\p{L}\s?/).join("")
(\p{L} is a Java regular expression - Openrefine is written in Java - equivalent to [a-zA-Z], but which also takes into account Unicode characters like accented letters.)
In general, you should avoid using the .match() method unless you know exactly what you are doing. In 90% of cases, it is actually .find() that is desired.

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

RegEx to clean VISA merchant names (remove random strings)

I am trying to develop a ReGex (.Net flavor), which I can use to clean VISA merchant names.
Examples:
Norton *AP1223506209 --> Norton *AP
Norton *AP1223511428
EUROWINGS VYJD6J_123001 --> EUROWINGS
EUROWINGS W6PDFI_125626
AER LINGUCB22QKM2 --> AER LINGUCB
AER LINGUCB248L2W
AIR FRANCE JWNCSC --> AIR FRANCE
AIR FRANCE K8L7TT
PAYPAL *AIRBNB HMQXBW --> PAYPAL *AIRBNB
PAYPAL *AIRBNB HMQXNZ
SAS 1174565172360 --> SAS
SAS 1174565172368
I would like to keep the first "name" part, but remove the second "gibberish" part.
The following Regex works for Norton and Air Lingu as well as for Eurowings and Air France, if they contain numbers in the gibberish part. It totally fails for PAYPAL *AIRBNB and other strings, that don't contain any numbers in the gibberish part, and also for SAS, probably because the name is too short / there are too many spaces:
Search:
([A-z *-]{2,50}[A-z]{2,50})(.{0,3}([0-9-]{0,3}[A-z *+.#-/]{0,3}){1,10})
Replace:
$1
Is there any way to make this work for gibberish parts that don't contain numbers? I have something like this in mind, but don't manage to create an according RegEx:
Group 1 (to keep)
Must contain consonants and vowels
Can contain few numbers, spaces or punctuation signs (e.g.: "7x7: Taxi Service")
Group 2 (to be removed)
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists of consonants, only
OR: consists of numbers, only
Thanks for any help and best regards
Pesche
Edit:
If I add more examples, Lindens solution still works quite well, but does not recognize all of the examples or in some cases too much of the string. I tried to adjust it, but with my lacking skills didn't quite succeed:
https://regex101.com/r/7y9zGl/4
The following problems remain:
with a length of 6 for the last \w, longer patterns would not be matched in full length (e.g. after easyjet and after EMP Merchan). Increasing it, however, causes other strings to be truncated (e.g. AER LINGU, potentially also HOTELS.COM if > 12 was used).
The merchant names after PAYPAL * and GOOGLE * should not be deleted, as they are true merchant names. I tried to exclude strings containing GOOGLE * with a negative lookbehind, but it does not seem to work like that.
Whereas the merchant name after PAYPAL * should generally remain, in some cases it is followed by gibberish, e.g. PAYPAL *AIRBNB HMQXBW. If the negative lookbehind worked, those cases would no longer be cleaned.
if the merchant name is not followed by gibberish, part of the name itself may be deleted (e.g. EMP Merchan)
As the full list of merchant names is long and versatile, the approach to detect "gibberish" should be as generic as possible (i.e. not rely on a certain length of the gibberish part). Hence my original, now slightly modified "pattern":
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists non or very few vowels (EASYJET 000ESJ5TWN -> the gibberish contains only one vowel, EASYJET 3 of them; PAYPAL *NITSCHKE -> NITSCHKE should not be matched, it contains 2 vowels)
OR: consists of numbers, only
Is such a thing even possible? The goal is to use SQL to clean the merchant names. If necessary, this can be done in several run throughs (for different kind of patterns).
Thx again!
Updated regex based on extended sample and desired results:
[\s*<]+\d+$|[\s*<]+(?![A-Z]{6}.*)\w*\d[\w>]*$|\d{6,}$|[\s*<]+[A-Z]{6}$|(?![A-Z]+$)(?<=[A-Z])\w{6}$
Demo
I cannot validate as I'm only on my phone, but can you try something like this?
^([0-9A-Za-z\*][ ]{0-2})
Take all the numbers, the letters (capital and minor) the star and max 2 spaces from the beginning of the line.
Please check the () but I guess the idea is here.
Sorry, it seems wrong when there is no double space.
You want to take all the char until 2 spaces or 2 numbers according to your examples.
.* {2}|.*[0-9]{2}
Is it better?
Regards,
Thomas

How to programmatically learn regexes?

My question is a continuation of this one. Basically, I have a table of words like so:
HAT18178_890909.098070313.1
HAT18178_890909.098070313.2
HAT18178_890909.143412462.1
HAT18178_890909.143412462.2
For my purposes, I do not need the terminal .1 or .2 for this set of names. I can manually write the following regex (using Python syntax):
r = re.compile('(.*\.\d+)\.\d+')
However, I cannot guarantee that my next set of names will have a similar structure where the final 2 characters will be discardable - it could be 3 characters (i.e. .12) and the separator could change as well (i.e. . to _).
What is the appropriate way to either explicitly learn a regex or to determine which characters are unnecessary?
It's an interesting problem.
X y
HAT18178_890909.098070313.1 HAT18178_890909.098070313
HAT18178_890909.098070313.2 HAT18178_890909.098070313
HAT18178_890909.143412462.1 HAT18178_890909.143412462
HAT18178_890909.143412462.2 HAT18178_890909.143412462
The problem is that there is not a single solution but many.
Even for a human it is not clear what the regex should be that you want.
Based on this data, I would think the possibilities to learn are:
Just match a fixed width of 25: .{25}
Fixed first part: HAT18178_890909.
Then:
There's only 2 varying numbers on each single spot (as you show 2 cases).
So e.g. [01] (either 0 or 1), [94] the next spot and so on would be a good solution.
The obvious one would be \d+
But it could also be \d{9}
You see, there are multiple correct answers.
These regexes would still work if the second point would be an underscore instead.
My conclusion:
The problem is that it is much more work to prepare the data for machine learning than it is to create a regex. If you want to be sure you cover everything, you need to have complete data, so then a regex is probably less effort.
You could split on non-alphanumeric characters;
[^a-zA-Z0-9']+
That would get you, in this case, few strings like this:
HAT18178
890909
098070313
1
From there on you can simply discard the last one if that's never necessary, and continue on processing the first sequences