Regular comma separated names would be easy to use regular expressions on, but my problem is: how would a regular expression distinguish between a list of names and a (last name, first name)?
This is the example I have:
Lawrence, Billy
Alex Newell, Jess Glynne, DJ Cassidy, Nile Rodgers
These are some examples of many that show up in a text file that I have and I need to distingush between them. Does anyone have a solution?
I thought about just counting the commas and distinguishing that way, but I also have examples like this:
Tisto, Sean Kingston & Flo Rida
This is the format (a list of artists), just to give you an idea of what I need in the end:
Lawrence, Billy
Alex Newell
Jess Glynne
DJ Cassidy
Nile Rodgers
Tisto
Sean Kingston
Flo Rida
To make it easier to parser you could add some constraints. For example, you could make every ones names two phrase and when you don't ether one of the word you could add a phrase as a filler. So then, when you parse the file every to phrase is a name. Then your delimiters are ' ', ',' and '&'
Related
I am trying to extract contact names of a data set, however, the names are compiled in one cell and not split up by first name, middle name, last name, email, etc.
I only need to get their names because I already have a data set only with their emails, NOT their names.
How do I extract multiple case-sensitive words and split into cells?
Here's how it looks like in one cell:
I've tried several codes I've found online and this is the only thing that comes close, however, it still extracts unnecessary lower case letters which I don't need. Please help, I'm no expert with these kinds of things.
=TRANSPOSE(SPLIT(TRIM(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
A2,"\b\w[^A-z]*\b"," "),"\W+"," "),"[0-9]+","")," m "," "))," "))
I expect them to have the first, middle, last names to be split into new columns like this:
Tom Billy Claudia Downey Karen Dicky Steve Harvey
OR
Tom Billy Claudia Downey Karen Dicky Steve Harvey
=ARRAYFORMULA(TRIM(IFERROR(REGEXREPLACE(IFERROR(REGEXEXTRACT(IFERROR(SPLIT(A2:A,
CHAR(10))), "(.*) .*#")), "Mr. |Mrs. ", ""))))
This formula might help. i have added the conditions to replace the email id and Mr./Ms. conditions.
=TRANSPOSE(SPLIT(TRIM(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(A2,"([a-zA-Z0-9_\-\.]+)#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]
{2,5})",""),"\w+[\\.]+(?)",""),"\b\w[^A-z]*\b"," "),"\W+"," "),"[0-9]+","")," m ","
"))," "))
I'm trying to use Notepadd ++ to find and replace regex to extract names from MS Outlook formatted meeting attendee details.
I copy and pasted the attendee details and got names like.
Fred Jones <Fred.Jones#example.org.au>; Bob Smith <Bob.Smith#example.org.au>; Jill Hartmann <Jill.Hartmann#example.org.au>;
I'm trying to wind up with
Fred Jones; Bob Smith; Jill Hartmann;
I've tried a number of permutations of
\B<.*>; \B
on Regex 101.
Regex is greedy, <.*> matches from the first < to the last > in one fell swoop. You want to say "any character which is neither of these" instead of just "any character".
*<[^<>]*>
The single space and asterisk before the main expression consumes any spaces before the match. Replace these matches with nothing and you will be left with just the names, like in your example.
This is a very common FAQ.
I need to make this exercise about regexes and text manipulation in vim.
So I have this file about the most scoring soccer players in history, with 50 entries looking like this:
1 Cristiano Ronaldo Portugal 88 121 0.73 03 Manchester United Real Madrid
The whitespaces between the fields are tabs (\t)
The fields each respond to a differen category: etc...
This last field contains one or more clubs the player has played in. (so not a fixed number of clubs)
The question: replace all tabs with a ';', except for the last field, where the clubs need to be seperated by a ','.
So I thought: I just replace all of them with a comma, and then I replace the first 7 commas with a semicolon. But how do you do that? Everything - from regex to vim commands - is allowed.
The first part is easy: :2,$s/\t/,/g
But the second part, I can't seem to figure out.
Any help would be greatly appreciated.
Thanks, Zeno
This answer is similar to #Amadan's, but it makes use of the ability to provide an expression as the replace string to actually do the difficult bit of changing the first set of tabs to semicolons:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/|%s/\t/,/g
Broken down this is a set of three substitute commands. The first two are cobbled together with a sub-replace-expression:
%s/\v(.{-}\t){7}/\=substitute(submatch('0'), '\t', ';', 'g')/
What this does is find exactly seven occurrances ({7}) of any character followed by a tab, in a non-greedy way. ((.{-}\t)). Then we replace this entire match (submatch(0)) with the result of the substitute expression (\=substitute(...)). The substitute expression is simple by comparison as it just converts all tabs to semicolons.
The last substitute just changes any other tabs on the line to commas.
See :help sub-replace-expression
Here's one way you could do it:
:let #q=":s/\t/;\<cr>"
:2,$norm 7#q
:2,$s/\t/,/g
Explanation:
First, we define a macro 'q' that will replace one tab with a semicolon. Now, on any line we can simply run this macro n times to replace the first n tabs. To automatically do this to every line, we use the norm command:
:2,$norm 7#q
This is essentially the same thing as literally typing 7#q (e.g. "run macro 'q' seven times") on every line in the specified range. From there, we can simply replace every tab with a comma.
:2,$s/\t/,/g
:2,$s/\t\(.*\t\)\#=/;/g
:2,$s/\t/,
Change any tabs where there is a tab later to ;
Change any remaining tabs to ,
EDIT: Misunderstood. Here is a fixed version:
:2,$s/\(\(\t.*\)\{7}\)\#<=\t/,/g
:2,$s/\t/;/g
Change any tabs where there's seven tabs before it to ,
Change any remaining tabs to ;
My PatternsOnText plugin has (among others) a :SubstituteSelected command that allows to specify the match positions. With this, you can easily replace the first 8 tabs with semicolons, and then use a regular substitute to change the remaining tabs into commas:
:2,$SubstituteSelected/\t/;/g 1-8
:2,$s/\t/,/g
We solved the issue by just capturing the first 8 groups manually ([^\t]*\t)(...)(...) and then separate them with a semicolon (\1;\2;...;) then replacing the remaining tabs with comma's | 2,$s/\t/,/g
Thanks to everyone trying to help!
I have a dataset (postgresql) with a field containing comma separated company names. Most company names are composed of regular characters (alphanumeric + space), but then there are some with a suffix such as ", inc." or ", ltd.". In order to split the the company names into separate strings, I need to remove the comma that is used to signal the company name suffixes first (and that is an external requirement). So, for instance in
Burn To Ground, Groupwise, Ltd., People, Inc., SepiaShot
my regex should be able to remove the 2nd and the 4th commas, but not the other ones. I would like to know if this can be done using regex. I have tried several solutions using balanced groups and look-arounds, but I couldn't make it work.
Aelor was close, but used a positive rather than a negative assertion and didn't handle the space. (Actually, per comment, Aelor answered the specific question posed; I'm showing how to avoid removing the commas entirely by ignoring them when splitting).
Also added a comprehensive list of company name suffixes from corporateinformation.com.
regress=> SELECT regexp_split_to_table(
'Burn To Ground, Groupwise, Ltd., People, Inc., SepiaShot',
'\,(?!\s(?:A\. en P\.|AB|AB|A\.C\.|ACE|AD|AE|AG|AG|AG|AL|AmbA|ANS|Apb|ApS|ApS & Co\. K/S|AS|A/S|A\.S\.|A\.S\.|A\.S\.|A\.S\.|ASA|AVV|Bpk|Bt|B\.V\.|B\.V\.|B\.V\.|BVBA|CA|Corp\.|C\.V\.|CVA|CVoA|DA|d/b/a|d\.d\.|d\.d\.|d\.n\.o\.|d\.o\.o\.|d\.o\.o\.|EE|EEG|EIRL|ELP|EOOD|EPE|EURL|e\.V\.|GbR|GCV|GesmbH|GIE|GmbH & Co\. KG|GmbH|GmbH|GmbH|HB|hf|IBC|Inc\.|Inc|I/S|j\.t\.d\.|KA/S|Kb|Kb|KD|k\.d\.|k\.d\.|KDA|k\.d\.d\.|Kft|KG|KG|KGaA|KK|Kkt|Kol\. SrK|Kom\. SrK|k\.s\.|K/S|KS|Kv|Ky|Lda|LDC|LLC|LLP|Ltd\.|Ltda|Ltée\.|N\.A\.|NT|NV|NV|NV|NV|OE|OHG|OHG|OOD|OÜ|Oy|OYJ|P/L|PC Ltd|PLC|PMA|PMDN|PrC|Prp\. Ltd\.|PT|Pty\.|RAS|Rt|S\. de R\.L\.|S\. en C\.|S\. en N\.C\.|S/A|SA|SA|SA|sa|SA|SA|SA|SA|SA|SA|SA|S\.A\.|SA de CV|SAFI|S\.A\.I\.C\.A\.|SApA|Sarl|Sarl|SAS|SC|SC|S\.C\.|SCA|SCA|SCP|SCS|S\.C\.S\.|SCS|Sdn Bhd|SENC|SGPS|SK|SNC|SNC|SNC|SNC|SOPARFI|sp|SpA|spol s\.r\.o\.|SPRL|Sp\. z\.o\.o\.|Srl|Srl|Srl|Srl|Srl|td|TLS|VEB|VOF|v\.o\.s\.)) ?',
'i'
);
regexp_split_to_table
-----------------------
Burn To Ground
Groupwise, Ltd.
People, Inc.
SepiaShot
(4 rows)
Tested on PostgreSQL 9.3.
Consider non-USA company suffixes too, e.g. the german "GMBH". I strongly recommend that you treat the results of your substitution as suspect, and get a human to verify that they are correct.
you can use this regex:
\,(?=\s(?:Ltd|Inc))
I assume you want to remove the commas before these words only, if you have more words like corp. reg. etc you can add them in the regex with a | like this
\,(?=\s(?:Ltd|Inc|Corp|Reg))
modify this regex according to your requirement
here is the demo for a quick reference:
http://regex101.com/r/rT5zB1
check the substitution result
I am trying to come up with a Regular Expression that I can use to find lines in a txt file that contain names in ALL CAPS using Notepad++ or similar tool. Once I find a line that matches I want to add three line breaks.
I have various conditions since the lines are names. Some of the names are only two characters. Some have hyphens. Some have multiple names. Some don't have spaces after their last name and comma. Here are some examples:
DOE, JOHN L
DOE-SMITH, JOHN L
DO, JO L
DOE, JOHN BOB L
DOE,JOHN L
I can run this in other programs as well. Just trying to figure this out so I can get it finished.
EDIT: I was using [A-Z]+, [A-Z]+ but it didn't select the whole line and it didn't account for spaces and hyphens.
ANSWER: The following regex met my needs:
^(?!.*[a-z])(?!.*[0-9]).+$
Part 2 ANSWER: I also made an adjustment in order to do the second part of my request which was to add three line breaks ahead of the matched item.
^((?!.*[a-z\d]).+)$
I also made sure Match Case was selected. It was using Regular Expression. and replaced with the following:
\n\n\n\1
Thanks Everyone!
Use a negative look ahead for a lowercase char:
^(?!.*[a-z]).+$
This matches "any line that doesn't contain a lowercase letter".
To also disallow numbers:
^(?!.*[a-z\d]).+$
Use Extended Regular Expressions with POSIX Character Classes
This will work for your provided corpus using GNU grep. Adapt to suit any changes to your data.
$ grep \
--extended-regexp \
--only-matching \
--regexp='[[:upper:]-]+, ?[[:upper:]]+' \
/tmp/corpus
DOE, JOHN
DOE-SMITH, JOHN
DO, JO
DOE, JOHN
DOE,JOHN
Adding Newline Characters with GNU Sed
You can perform this operation with the append operation in GNU sed. For example:
$ sed \
--regexp-extended '/[[:upper:]-]+, ?[[:upper:]]+/a\\n\n\n' \
/tmp/corpus
DOE, JOHN L
DOE-SMITH, JOHN L
DO, JO L
DOE, JOHN BOB L
DOE,JOHN L