Regex for putting comma before city name in address - regex

Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1

You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.

Related

RegexReplace the nth occurrence of a string of underscores

I'm having trouble getting a REGEXREPLACE working in a Google Sheets formula. I'm aiming to replicate a certain card game which is opposed to humankind. I have a cell containing a string which contains one, two or three occurrences of a series of underscores, e.g.
"_____ is the new _____"
And let's say I want to substitute in the strings "Orange" for the first occurrence, and "Black" for the second occurrence.
I don't know how many underscores will be in each string, it could be one or more, so it seems like a job for regex. I tried SUBSTITUTE and it didn't seem to recognise asterisks. Based on this link, I tried using {1} {2} and {3} to match the first/second/third occurrence, but I'm not doing something right:
=REGEXREPLACE(G16,".*(_*){1}.*",G17)
G16 is: _____ is the new _____.
G17 is: Orange
The output of the formula is: OrangeOrange.
Can anyone help me figure out the correct way to do this?
You may use
=REGEXREPLACE(REGEXREPLACE(G16,"^([^_]*)_+","$1Orange"), "^([^_]*)_+", "$1Black")
|----- First occurrence -----------------|
|----------------- Second occurrence ------------------------------------------|
Details
^ - start of string
([^_]*) - Capturing group 1 ($1 will refer to this group value): 0 or more chars other than an underscore
_+ - 1 or more underscores.

How to remove specific characters in notepad++ with regex?

This is data present in my .txt file
+919000009998 SMS +919888888888
+919000009998 MMS +91988 88888 88
+919000009998 MMS abcd google
+919000009998 MMS amazon
I want to convert my .txt like this
919000009998 SMS 919888888888
919000009998 MMS 919888888888
919000009998 MMS abcd google
919000009998 MMS amazon
removing the + symbol, and also the spaces if present in third column only if it is a number, if it is string no operation to be performed
is there any regex to do this which can I write in search and replace in notepad++?
Ctrl+H
Find what: \+|(?<=\d)\h+(?=\d)
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
\+ # + sign
| # OR
(?<=\d) # positive lookbehind, make sure we have a digit before
\h+ # 1 or more horizontal spaces
(?=\d) # positive lookahead, make sure we have a digit after
Screen capture:
All previous answer will perfectly work.
However, I'm just adding this just in case you need it:
If for some reason you had non-phone numbers on the third column separated by spaces (a street comes to mind for me +919000009998 MMS street foo nº 123 4º-B) you may use this regex instead (It will join number as long as the third column starts by +):
Search: ^[+](\S+\s+\S+\s++)(?:([^+][^\n]*)|[+])|\G\s*(\d+)
Replace by: \1\2\3
That will avoid joining the 3 and 4 on my previous example.
You have a demo here.

Capture the latest in backreference

I have this regex
(\b(\S+\s+){1,10})\1.*MY
and I want to group 1 to capture "The name" from
The name is is The name MY
I get "is" for now.
The name can be any random words of any length.
It need not be at the beginning.
It need on be only 2 or 3 words. It can be less than 10 words.
Only thing sure is that it will be the last set of repeating words.
Examples:
The name is Anthony is is The name is Anthony - "The name is Anthony".
India is my country All Indians are India is my country - "India is my country "
Times of India Alphabet Google is the company Alphabet Google canteen - "Alphabet Google"
You could try:
(\b\w+[\w\s]+\b)(?:.*?\b\1)
As demonstrated here
Explanation -
(\b\w+[\w\s]+\b) is the capture group 1 - which is the text that is repeated - separated by word boundaries.
(?:.*?\b\1) is a non-capturing group which tells the regex system to match the text in group 1, only if it is followed by zero-or-more characters, a word-boundary, and the repeated text.
Regex generally captures thelongest le|tmost match. There are no examples in your question where this would not actualny be the string you want, but that could just mean you have not found good examples to show us.
With that out of the way,
((\S+\s)+)(\S+\s){0,9}\1
would appear to match your requirements as currently stated. The "longest leftmost" behavior could still get in the way if there are e.g. straddling repetitions, like
this that more words this that more words
where in the general case regex alone cannot easily be made to always prefer the last possible match and tolerate arbitrary amounts of text after it.

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :

Regex for single space

I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?
Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*
This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt
You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).
I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.
Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.