I have some data in text format which reads like this:
11-Jun-97 Jason Smith Pizza 175 Cafe Australia Aaron & James
12-Jul-97 Alan Davidson Fried Chicken 183 Outdoors New Zealand Anthony
In short there is a date, person's name, food, ID, location, country and source. This is consistent throughout but the spacing between the items is variable.
I would like to replace the spaces with a comma so I can view in a spreadsheet. Thanks
The major problem you face is that we don't know what a single space means here. In the case of the country New Zealand, the space is part of the data itself which you want to keep. In the case of the many spaces separating most of the columns, they have no real meaning, and you want to replace them with something else.
That being said, you should be able to target groups of two or more spaces.
First trim away whitespace occurring at the beginning/end of each line:
Find: ^[ ]+(.*\S+)[ ]+$
Replace: $1
Then replace the internal multi spaces with comma, followed by a single space:
Find: [ ]{2,}
Replace: ,[ ] <-- one space after the comma
Related
Generally address comes with comma seperationa and can be splitted using simple regex. e.g
123 Main St, Los Angeles, CA, 90210
We can apply regex here and split using comma. But in my database addresses are stored without comma. e.g
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH CA 90803-4241
And I want to put comma before the city. Something like this:
A Better Property Management<br/> 6621 E PACIFIC COAST HWY<br/> STE 255<br/> LONG BEACH ,CA 90803-4241
I was thing about finding the last two letter word from the end and put comma using regex . But I also need to account for the situations where we don't have complete address or missing city and pincodes. Is there a way this can be done. I only found solutions where we can split using comma but not the reverse.
I was thinking if we could select the last 2 words before numbers with something like [A-Za-z]{2} (don't know if this is correct). And at the same time if we can check to do this only if the string ends with numbers.
I tried
(\b(AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming)\b)
https://regex101.com/r/75fqO6/1
You can use
[a-zA-Z]+\s+\d(?:[\d-]*\d)?$
Replace with ,$0.
See the regex demo. Details:
[a-zA-Z]+ - one or more letters
\s+ - one or more whitespaces
\d - a digit
(?:[\d-]*\d)? - an optional substring of zero or more digits/hyphens and then a digit
$ - end of string.
The $0 in the replacement is a backreference to the whole match value, all text matched by the regex is put back where it was found with a prepended comma.
^(.*)(\r?\n\1)+$
replace with \1
The above is a great way to remove duplicate lines using REGEX
but it requires the entire line to be a duplicate
However – what would I use if I want to detect and remove dups – when the entire line s a whole is not a dup – but just the first X characters
Example:
Original File
12345 Dennis Yancey University of Miami
12345 Dennis Yancey University of Milan
12345 Dennis Yancey University of Rome
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
Dups Removed
12345 Dennis Yancey University of Miami
12344 Ryan Gardner University of Spain
12347 Smith John University of Canada
How about using a second group for checking eg the first 10 characters:
^((.{10}).*)(?:\r?\n\2.*)+
Where {n} specifies the amount of the characters from linestart that should be dupe checked.
the whole line is captured to $1 which is also used as replacement
the second group is used to check for duplicate line starts with
See this demo at regex101
Another idea would be the use of a lookahead and replace with empty string:
^(.{10}).*\r?\n(?=\1)
This one will just drop the current line, if captured $1 is ahead in the next line.
Here is the demo at regex101
For also removing duplicate lines, that contain up to 10 characters, a PCRE idea using conditionals: ^(?:(.{10})|(.{0,9}$)).*+\r?\n(?(1)(?=\1)|(?=\2$)) and replace with empty string.
If your regex flavor supports possessive quantifiers, use of .*+ will improve performance.
Be aware, that all these patterns (and your current regex) just target consecutive duplicate lines.
I'd like to create a new column of phone numbers with no dashes. I have data that is a mix of just numbers and some numbers with dashes. The data looks as follows:
Phone
555-555-5555
1234567890
555-3456789
222-222-2222
51318312491
Since you are dealing with a very straightforward substitution, you can easily use gsub to find the character you want to remove and replace it with nothing.
Assuming your dataset is called "mydf" and the column of interest is "Phone", try this:
gsub("-", "", mydf$Phone)
Building on the answer of #Ananda Mahto, it seemed useful to show how to break the numbers up again and put a parenthetical around the area code.
phone <- c("1234567890", "555-3456789", "222-222-2222", "5131831249")
phone <- gsub("-", "", phone)
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1) \\2 \\3", phone)
[1] "(123) 456 7890" "(555) 345 6789" "(222) 222 2222" "(513) 183 1249"
The second regex creates three capture groups, two with three digits and the final one with four. Then R substitutes them back in with a space between each and ( ) around the first one. You could also put hyphens between capture group 2 and capture group 3. [Not sure at all why Skype appeared out of nowhere!]
I'm considering a regex to restrict punctuation in city names (worldwide). What would be a fairly inclusive whitelist of these?
I'm thinking:
(space)
. period
- hyphen
' apostrophe
Also thinking maybe comma or slash but I don't have any examples. Are there others?
This is the most inclusive whitelist of punctuation to be found in city names. The ASCII apostrophe codepoint may not be the one used when someone is entering an apostrophe on their keyboard.
If you've discerned the encoding of the submitted text, you should be able to see if it falls under the Punctuation block:
/\p{InGeneral_Punctuation}/
If you are limiting yourself to Latin-Extended, just use those:
/\p{InLatin_Extended-A}/
Also, ask yourself: What are the consequences of someone putting a funny character into my city name? Is that worse than the consequences of someone not being able to enter their correct address, if I exclude too much?
USPS standard address formatting calls for stripping all special characters except 'necessary' hyphens and dashes used in the primary and/or secondary street address lines and hyphens in the ZIP.
So if an address is:
John O'Toole
456 N 4-1/2 St
San José, CA 99999-4545
The post office prefers envelopes be labeled:
John O Toole
456 N 4 1/2 St
San Jose CA 9999-4545
I'm trying to match a file which is delimited by multiple spaces. The problem I have is that the first field can contain a single space. How can I match this with a regex?
Eg:
Name Other Data Other Data 2
Bob Smith XX1 0101010101
John Doe XX2 0101010101
Bob Doe XX3 0101010101
John Smith XX4 0101010101
Can I split these lines into three fields with a regex, splitting by a space but allowing for the single space in the first field?
Hi the following regex should work
(\w*\s\w*)\s+\w{2}\d\s+\d*
This would work:
Pattern:
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Replacement:
+$1+ -$2- *$3*
$1 contains the first column, $2 the second and $3 the third one.
Example:
http://regexr.com?32tbt
You could split at two or more spaces:
[ ]{2,}
But you are probably better off, determining the lengths of the captures of this regular expression:
(Name[ ]+)(Other Data[ ]+)
And then to use a simple substring method that slices your lines into portions of the same length.
So in your case the first capture would be 15 characters long, the second 14 and the column would have 13 (but the last one doesn't really matter, which is why it isn't actually captured). Then you take the first 15, the next 14 and the remaining characters of every line and trim each one (remove trailing whitespace).
I think the simplest is to use a regex that matches two or more spaces.
/ +/
Which breaks down as... delimiter (/) followed by a space () followed by another space one or more times (+) followed by the end delimiter (/ in my example, but is language specific).
So simply put, use regex to match space, then one or more spaces as a means to split your string.
Usually, with this kind of files, the best approach is to get a substring based on where your required information is and then trim it. I see your file contains 16 chars before the second field, you can get a substring of length 16 from the beginning which will contain your desired text. You should trim it to get only the text you need without the spaces.
If the spacing pattern you posted is consistent (if it won't change among different files of this kind) you have also another problem: what happens to longer names?
Name Other Data
Johnny AppleseeXX1
TutankamonfirstXX2
if you really want to use a regex, be sure to avoid those corner cases.