R - Remove dashes from a column with phone numbers

R - Remove dashes from a column with phone numbers - regex

I'd like to create a new column of phone numbers with no dashes. I have data that is a mix of just numbers and some numbers with dashes. The data looks as follows:
Phone
555-555-5555
1234567890
555-3456789
222-222-2222
51318312491

Since you are dealing with a very straightforward substitution, you can easily use gsub to find the character you want to remove and replace it with nothing.
Assuming your dataset is called "mydf" and the column of interest is "Phone", try this:
gsub("-", "", mydf$Phone)

Building on the answer of #Ananda Mahto, it seemed useful to show how to break the numbers up again and put a parenthetical around the area code.
phone <- c("1234567890", "555-3456789", "222-222-2222", "5131831249")
phone <- gsub("-", "", phone)
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1) \\2 \\3", phone)
[1] "(123) 456 7890" "(555) 345 6789" "(222) 222 2222" "(513) 183 1249"
The second regex creates three capture groups, two with three digits and the final one with four. Then R substitutes them back in with a space between each and ( ) around the first one. You could also put hyphens between capture group 2 and capture group 3. [Not sure at all why Skype appeared out of nowhere!]

Related

Regex Select & Replace to Clean Up US phone numbers

We pull in a list of phone numbers as part of a datafeed. They are all for North America based companies. I would like to remove any leading "1" or "+1" and any trailing information like "x100", " EXT400", etc. They are stored in MariaDB so I would like to do
UPDATE `CompanyPhone` SET `number`= REGEXP_SUBSTR(`number`,pattern)
to remove the unwanted stuff, I just need the REGEX to select the correct part of the phone number.
"1 (555) 555-5555 x100" -> "(555) 555-5555"
"+15555555555 EXT400" -> "5555555555"
" 555-555-5555" -> "555-555-5555" (remove leading space)
Basically, I need just the first 10 digits, ignoring the first digit if it is a 1, and the formatting currently in the first 10 digits ("()" or " " or "-") if it is possible to keep it.
If everything could be reformatted to (555) 555-5555 that would be a bonus but is not required. I could do this a 2nd query if needed.

You could use REGEXP_REPLACE for this. Assuming you are using MariaDB 10.0.5 or later, you can use PCRE regular expressions. For your sample expressions, this regexp will give you the desired results (demo on Regex101). It looks for 3 groups of numbers (3 digits, 3 digits and then 4 digits) possibly preceded by a 1, and with other non-digit characters (e.g. +, -) around them.
^(?:\D*)1?(?:\D*)(\d{3})(?:\D*)(\d{3})(?:\D*)(\d{4}).*$
So your UPDATE statement will become
UPDATE `CompanyPhone` SET `number`= REGEXP_REPLACE(`number`, '^(?:\\D*)1?(?:\\D*)(\\d{3})(?:\\D*)(\\d{3})(?:\\D*)(\\d{4}).*$', '(\\1) \\2-\\3')

Strip out variable spaces and replace with comma

I have some data in text format which reads like this:
11-Jun-97 Jason Smith Pizza 175 Cafe Australia Aaron & James
12-Jul-97 Alan Davidson Fried Chicken 183 Outdoors New Zealand Anthony
In short there is a date, person's name, food, ID, location, country and source. This is consistent throughout but the spacing between the items is variable.
I would like to replace the spaces with a comma so I can view in a spreadsheet. Thanks

The major problem you face is that we don't know what a single space means here. In the case of the country New Zealand, the space is part of the data itself which you want to keep. In the case of the many spaces separating most of the columns, they have no real meaning, and you want to replace them with something else.
That being said, you should be able to target groups of two or more spaces.
First trim away whitespace occurring at the beginning/end of each line:
Find: ^[ ]+(.*\S+)[ ]+$
Replace: $1
Then replace the internal multi spaces with comma, followed by a single space:
Find: [ ]{2,}
Replace: ,[ ] <-- one space after the comma

Extract contents within brackets using R and Regex

I have a data-frame that contains user names in the format
"John Smith (Company Department)"
I want to extract the department from the username to add it to its own separate column.
I have tried the below code but it fails if the user name is something like
"John Smith (Company Department) John Doe)"
Can anyone help. Reg-ex isn't my strong suit and the below code will only work if the username is non standard like my example above with multiple brackets
strcol <- "John Smith (FPO Sales) John Doe)"
start_loc <- str_locate_all(pattern ='\\(FPO ',strcol)[[1]][2]
end_loc <- str_locate_all(pattern ='\\)',strcol)[[1]][2]
substr(strcol,start_loc +1, end_loc -1)))
Expected Output:
Sales
I have also tried the post here using non greedy, but got the following error:
Error: '[' is an unrecognized escape in character string starting ""/["
Note: the company will always be the same

You may use sub
> strcol <- "John Smith (FPO Sales) John Doe)"
> sub(".*\\(FPO[^)]*?(\\w+)\\).*", "\\1", strcol)
[1] "Sales"
.*\\(FPO would match all the characters upto the (FPO
[^)]*? this would match any char but not of ) zero or ore times.
(\\w+)\\) captures one or more word characters exists at the last within the same brackets itself.
.* would match all the remaining characters.
So by replacing all the matched chars with the chars present inside group index 1 will give you the desired output.
OR
> library(stringr)
> str_extract(strcol, perl("FPO[^)]*?\\K\\w+(?=\\))"))
[1] "Sales"

gsub('.*\\s(.*)\\).*\\)$','\\1',strcol)
[1] "Sales"

Regex for telephone number with or without spaces

How do I create a regex that matches telephones with or without spaces in the number?
I have found:
^\+?\d+$
From another post but how do I modify that to allow 0 or more spaces in the number?

The first thing you need to think is the exact format you want for phone numbers containing spaces. Eg:
+535 233 4444
Is that one OK? It means divided like: 3 3 4. You can adapt the following regex to your needs:
^\+?\d{3}\s?\d{3}\s?\{d}{4}$
Just change the quantifiers ({3}, {4}, etc) to change the group lengths.

This is one example:
/^(?:\s*\d{3})?\s*\d{3}\s*\d{4}\s*$/

There's a lot of ways to match telephone numbers (and a lot of valid telephone formats). Here's a simple regex to match "5555555555", "555 555 5555", "(555) 555-5555", "555-555-5555", or "555.555.5555"
^(?\d{3})?( |-|.)?\d{3}( |-|.)?\d{4}$

Regex for single space

I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?

Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*

This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt

You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).

I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.

Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

R - Remove dashes from a column with phone numbers - regex

I'd like to create a new column of phone numbers with no dashes. I have data that is a mix of just numbers and some numbers with dashes. The data looks as follows: Phone 555-555-5555 1234567890 555-3456789 222-222-2222 51318312491

Since you are dealing with a very straightforward substitution, you can easily use gsub to find the character you want to remove and replace it with nothing. Assuming your dataset is called "mydf" and the column of interest is "Phone", try this: gsub("-", "", mydf$Phone)

Related

Regex Select & Replace to Clean Up US phone numbers

Strip out variable spaces and replace with comma

Extract contents within brackets using R and Regex

Regex for telephone number with or without spaces

Regex for single space

Categories

Resources