Can we create a regular expression that matches every founder in this list? - regex

User #adventured posted this on Hacker News:
Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); Ev Williams (34, Twitter); Jack Dorsey (33, Square); Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); Travis Kalanick (32, Uber); Brian Chesky (27, Airbnb); Adam Neumann (31, WeWork); Reed Hastings (37, Netflix); Reid Hoffman (36, LinkedIn); Jack Ma (35, Alibaba); Jeff Bezos (30, Amazon); Jerry Sanders (33, AMD); Marc Benioff (35, Salesforce); Ross Perot (32, EDS); Peter Norton (39, Norton); Larry Ellison (33, Oracle); Mitch Kapor (32, Lotus); Leonard Bosack (32, Cisco); Sandy Lerner (29, Cisco); Gordon Moore (39, Intel); Mark Cuban (37, Broadcast.com); Scott Cook (31, Intuit); Nolan Bushnell (29, Atari); Paul Galvin (33, Motorola); Irwin Jacobs (52, Qualcomm); David Duffield (46, PeopleSoft | 64, Workday); Aneel Bhusri (39, Workday); Thomas Siebel (41, Siebel Systems); John McAfee (42, McAfee); Gary Hendrix (32, Symantec); Scott McNealy (28, Sun); Pierre Omidyar (28, eBay); Rich Barton (29, Expedia | 38, Zillow); Jim Clark (38, SGI | 49, Netscape); Charles Wang (32, CA); David Packard (27, HP); Craig Newmark (43, Craigslist); John Warnock (42, Adobe); Robert Noyce (30, Fairchild | 41, Intel); Rod Canion (37, Compaq); Jen-Hsun Huang (30, nVidia); James Goodnight (33, SAS); John Sall (28, SAS); Eli Harari (41, SanDisk); Sanjay Mehrotra (28, SanDisk); Al Shugart (48, Seagate); Finis Conner (34, Seagate); Henry Samueli (37, Broadcom); Henry Nicholas (32, Broadcom); Charles Brewer (36, Mindspring); William Shockley (45, Shockley); Ron Rivest (35, RSA); Adi Shamir (30, RSA); John Walker (32, Autodesk); Halsey Minor (30, CNet); David Filo (28, Yahoo); Jeremy Stoppelman (27, Yelp); Eric Lefkofsky (39, Groupon); Andrew Mason (29, Groupon); Markus Persson (30, Mojang); David Hitz (28, NetApp); Brian Lee (28, Legalzoom); Demis Hassabis (34, DeepMind); Tim Westergren (35, Pandora); Martin Lorentzon (37, Spotify); Ashar Aziz (44, FireEye); Kevin O'Connor (36, DoubleClick); Ben Silbermann (28, Pinterest); Evan Sharp (28, Pinterest); Steve Kirsch (38, Infoseek); Stephen Kaufer (36, TripAdvisor); Michael McNeilly (28, Applied Materials); Eugene McDermott (52, Texas Instruments); Richard Egan (43, EMC); Gary Kildall (32, Digital Research); Hasso Plattner (28, SAP); Robert Glaser (32, Real Networks); Patrick Byrne (37, Overstock.com); Marc Lore (33, Diapers.com); Ed Iacobucci (36, Citrix Systems); Ray Noorda (55, Novell); Tom Leighton (42, Akamai); Daniel Lewin (28, Akamai); Diane Greene (43, VMWare); Mendel Rosenblum (36, VMWare); Michael Mauldin (35, Lycos); Tom Anderson (33, MySpace); Chris DeWolfe (37, MySpace); Mark Pincus (41, Zynga); Caterina Fake (34, Flickr); Stewart Butterfield (31, Flickr | 36, Slack); Kevin Systrom (27, Instagram); Adi Tatarko (37, Houzz); Brian Armstrong (29, Coinbase); Pradeep Sindhu (43, Juniper); Peter Thiel (31, PayPal | 37, Palantir); Jay Walker (42, Priceline.com); Bill Coleman (48, BEA Systems); Evan Goldberg (35, NetSuite); Fred Luddy (48, ServiceNow); Michael Baum (41, Splunk); Nir Zuk (33, Palo Alto Networks); David Sacks (36, Yammer); Jack Smith (28, Hotmail); Sabeer Bhatia (28, Hotmail); Chad Hurley (28, YouTube); Andy Rubin (37, Danger | 41, Android); Rodney Brooks (36, iRobot); Jeff Hawkins (35, Palm); Tom Gosner (39, DocuSign); Niklas Zennström (37, Skype); Janus Friis (27, Skype); George Kurtz (40, CrowdStrike); Trip Hawkins (28, EA); Gabe Newell (33, Valve); David Bohnett (38, Geocities); Bill Gross (40, GoTo.com/Overture); Subrah Iyar (38, WebEx); Eric Yuan (41, Zoom); Min Zhu (47, WebEx); Bob Parsons (47, GoDaddy); Wilfred Corrigan (43, LSI); Joe Parkinson (33, Micron); Aart J. de Geus (32, Synopsys); Patrick Byrne (37, Overstock); Matthew Prince (34, Cloudflare); Ben Uretsky (28, DigitalOcean); Tom Preston-Werner (28, GitHub); Louis Borders (48, Webvan); John Moores (36, BMC Software); Vivek Ranadivé (40, Tibco); Pony Ma (27, Tencent); Robin Li (32, Baidu); Liu Qiangdong (29, JD.com); Lei Jun (40, Xiaomi); Ren Zhengfei (38, Huawei); Arkady Volozh (36, Yandex); Hiroshi Mikitani (34, Rakuten); Morris Chang (56, Taiwan Semi); Cheng Wei (29, Didi Chuxing); James Liang (29, Ctrip); Zhang Yiming (29, ByteDance);
I tried to write a Regex that would have each "Match group" correspond to these founders. I was able to get 136/144 of the entries, but I'm kind of confused on how to capture the founders with the pipe entries (Elon Musk, David Duffield, Rich Barton, Robert Noyce, etc. Here is an example:
Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
I know I can escape the pipes with \| but even wrapping the "paren part" with an * doesn't seem to do it.
Here's the regular expression I created:
([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\);
(I removed the last semi-colon so that I could perform my matches after just running a split(";") on the file contents.
I created a simple repro which is here: https://github.com/arthurcolle/founders
Here's the code inline, in case you don't want to just go to the very simple repro:
rgx = /([A-Za-zé'.\/\s+-]+{2})\s+\(([0-9]+),\s+([A-Za-z0-9\s+.-\|]+\s?)\)/
FOUNDERS_FILE = "/Users/stochastic-thread/founders/founders.txt"
file = File.read(FOUNDERS_FILE)
items = file.split(";")
items.each {|item|
matched = rgx.match(item)
if matched and matched.size == 4
group = "#{matched[1]},#{matched[2]},#{matched[3]}\n"
puts group
File.open("founders.csv", mode: "a") do |f|
f.write(group)
end
end
}
What is the regular expression that would match on every "founder-company" group, taking into account the fact that every single founder might have multiple founded companies, with corresponding ages (in the specific format detailed above in the case of Elon Musk? (The ö character is unicode, so I don't think I'm able to actually match on it because when I put it in the name match group, it said multi-byte characters don't work.)
I know that I can just find entries that don't match the regex, and use a different regex that only matches the parenthesis format, or even just split again on the pipes, but I'm trying to find a "perfect regex" for this.

The question only asks for the founders to be matched, so initially I have not included their enterprises. Later, however, I will discuss a possible way to organize all the information.
Use String#scan with the following regular expression, which I've defined in free-spacing mode to make it self-documenting.
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
(?=\s\() # match a whitespace followed by "(" in a positive lookahead
/x # free-spacing regex definition mode
str = "Paul Graham (31, Viaweb); Jan Koum (33, WhatsApp); Brian Acton (37, WhatsApp); " +
"Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal); Garrett Camp (30, Uber); " +
"Travis Kalanick (32, Uber);"
str.scan(r)
#=> ["Paul Graham", "Jan Koum", "Brian Acton", "Elon Musk", "Garrett Camp",
# "Travis Kalanick"]
This regular expression is conventionally written as follows.
/(?<=\A|; )[\p{L} ]+(?= \()/
If additional information is needed it may be desirable to create a hash such as the following.
r = /
(?<=\A|;\s) # match the beginning of the string or a semi-colon
# followed by a whitespace char in a positive lookbehind
[\p{L} ]+ # match one or more Unicode letters or spaces
\([^)]+ # match a "(" followed by > 0 characters other than ")"
/x
h = str.scan(r).
map { |s| s.split(/ \(/) }.
each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}],
# "Garrett Camp" =>[{:age=>30, :co=>"Uber"}],
# "Travis Kalanick"=>[{:age=>32, :co=>"Uber"}]}
One could then easily compute, for example,
h.each_with_object(Hash.new { |h,k| h[k] = [] }) do |(name, cos),g|
cos.each { |co| g[co[:co]] << name }
end
#=> {"Viaweb"=>["Paul Graham"],
# "WhatsApp"=>["Jan Koum", "Brian Acton"],
# "Tesla"=>["Elon Musk"],
# "SpaceX"=>["Elon Musk"],
# "PayPal"=>["Elon Musk"],
# "Uber"=>["Garrett Camp", "Travis Kalanick"]}
The regular expression used here is conventionally written:
/(?<=\A|; )[\p{L} ]+\([^\)]+/
The steps to compute h are as follows.
a = str.scan(r)
#=> ["Paul Graham (31, Viaweb", "Jan Koum (33, WhatsApp", "Brian Acton (37, WhatsApp",
# "Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal", "Garrett Camp (30, Uber",
# "Travis Kalanick (32, Uber"]
b = a.map { |s| s.split(/ \(/) }
#=> [["Paul Graham", "31, Viaweb"], ["Jan Koum", "33, WhatsApp"],
# ["Brian Acton", "37, WhatsApp"],
# ["Elon Musk", "32, Tesla | 31, SpaceX | 27, PayPal"],
# ["Garrett Camp", "30, Uber"], ["Travis Kalanick", "32, Uber"]]
h = b.each_with_object({}) do |(name, startups),h|
h[name] = startups.split(/ *\| */).map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
end
#=> <as above>
In computing h from b, when
name = "Elon Musk"
startups = "32, Tesla | 31, SpaceX | 27, PayPal"
h = {"Paul Graham" =>[{:age=>31, :co=>"Viaweb"}],
"Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
"Brian Acton" =>[{:age=>37, :co=>"WhatsApp"}]}
the block calculation is as follows.
c = startups.split(/ *\| */)
#=> ["32, Tesla", "31, SpaceX", "27, PayPal"]
d = c.map do |s|
age, co = s.split(/, +/)
{ age: age.to_i, co: co }
end
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
h[name] = d
#=> [{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]
and now
h #=> {"Paul Graham"=>[{:age=>31, :co=>"Viaweb"}],
# "Jan Koum" =>[{:age=>33, :co=>"WhatsApp"}],
# "Brian Acton"=>[{:age=>37, :co=>"WhatsApp"}],
# "Elon Musk" =>[{:age=>32, :co=>"Tesla"}, {:age=>31, :co=>"SpaceX"},
# {:age=>27, :co=>"PayPal"}]}

My guess is that maybe this expression might simply work OK:
\s*(\d+)\s*,\s*([^)|]*)(?=\s*\||\s*\))|([^(\r\n]*)\(
Test
re = /\s*(\d+)\s*,\s*([^)|]*)(?=\s*\||\s*\))|([^(\r\n]*)\(/
str = 'Elon Musk (32, Tesla | 31, SpaceX | 27, PayPal);
Garrett Camp (30, Uber);
'
str.scan(re) do |match|
puts match.to_s
end
Output
[nil, nil, "Elon Musk "]
["32", "Tesla ", nil]
["31", "SpaceX ", nil]
["27", "PayPal", nil]
[nil, nil, "Garrett Camp "]
["30", "Uber", nil]
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Related

Validate Title Case Full Name with Regex

To learn Regex, I was solving some problems to train and study. And this is the problem, i know it might not be the best way to do with Regex, and my Regex is a mess, but i liked the challenge.
Problem:
The names needs to be Title Case;
There are exceptions for some lowercase words inside;
And some Names, e.g.: McDonald, MacDuff, D'Estoile
Names with ' and - are accepted, and sometimes they are o'Brien, O'brien, O'Brien, O' Brien or 'Ehu Kali.
No whitespaces on the beggining and end of Name;
No more than one space between each Name of Full Name;
A . is accepted if not alone, e.g.: Dan . Ferdnand (isn't accepted) and Dan G. Ferdnand (is accepted)
Numbers and symbols are not accepted
However, Roman numbers are accepted and aren't Title Case, e.g.: Elizabeth II
Some names can be alone, e.g.: Akihito (Prince of Japan)
Some special characters common in some countries are accepted, e.g.: Valeh ßlÿsgÿroğlu, Lażżru Role, Alaksiej Taraškievič
Regex
The code is
^(?![ ])(?!.*(?:\d|[ ]{2}|[!$%^&*()_+|~=`\{\}\[\]:";<>?,\/]))(?:(?:e|da|do|das|dos|de|d'|la|las|el|los|l'|al|of|the|el-|al-|di|van|der|op|den|ter|te|ten|ben|ibn)\s*?|(?:[A-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð'][^\s]*\s*?)(?!.*[ ]$))+$
And the Regex101 with a validation list
References
What i tried so far was based on these:
regular expression for first and last name
Regular Expression to disallow two consecutive white spaces in the middle of a string
A regex to test if all words are title-case
How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
Use Regex to Split Numbered List array into Numbered List Multiline
Not working
I did this Regex and don't know how to make a way for it to not recognize the cases below, that are matching:
CAPITAL LETTER
AlTeRnAtE LeTtEr
And those aren't and should:
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Question
Is there a way to optimize this Regex (monster)?
And how do i fix the problems stated before on Not working?
p.s.: The list of names with examples for validation can be found on the link to Regex101.
Brief
Seeing as how you're learning Regex and haven't specified a regex flavour to use, I've chosen PCRE as it has a wide variety of support in the regex world.
Code
See this regex in use here
(?(DEFINE)
(?# Definitions )
(?<valid_nameChars>[\p{L}\p{Nl}])
(?<valid_nonNameChars>[^\p{L}\p{Nl}\p{Zs}])
(?<valid_startFirstName>(?![a-z])[\p{L}'])
(?<valid_upperChar>(?![a-z])\p{L})
(?<valid_nameSeparatorsSoft>[\p{Pd}'])
(?<valid_nameSeparatorsHard>\p{Zs})
(?<valid_nameSeparators>(?&valid_nameSeparatorsSoft)|(?&valid_nameSeparatorsHard))
(?# Invalid combinations )
(?<invalid_startChar>^[\p{Zs}a-z])
(?<invalid_endChar>.*[^\p{L}\p{Nl}.\p{C}]$)
(?<invalid_unaccompaniedSymbol>.*(?&valid_nameSeparatorsHard)(?&valid_nonNameChars)(?&valid_nameSeparatorsHard))
(?<invalid_overTwoUpper>(?:(?&valid_nameChars)*\p{Lu}){3})
(?<invalid>(?&invalid_startChar)|(?&invalid_endChar)|(?&invalid_unaccompaniedSymbol)|(?&invalid_overTwoUpper))
(?# Valid combinations )
(?<valid_name>(?:(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*(?&valid_nameChars)+(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*)+\.?)
(?<valid_firstName>(?&valid_startFirstName)(?:\.|(?&valid_name)*))
(?<valid_multipleName>(?&valid_firstName)(?=.*(?&valid_nameSeparators)(?&valid_upperChar))(?:(?&valid_nameSeparatorsHard)(?&valid_name))+)
(?<valid>(?&valid_multipleName)|(?&valid_firstName))
)
^(?!(?&invalid))(?&valid)$
Results
Input
== 1NcOrrect N4M3S ==
CAPITAL LETTER
AlTeRnAtE LeTtEr
Natalia maria
Natalia aria
Natalia orea
Maria dornelas
Samuel eto'
Miguel lasagna
Antony1 de Home Ap*ril
Ap*ril Willians
Antony_ de Home Apr+il
Ant_ony de Home Apr#il
Antony# de Ho#me Apr^il
Maria Silva
Maria silva
maria Silva
Maria Silva
Maria Silva
Maria / Silva
Maria . Silva
John W8
==Correct Names==
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai
==EXTRA== only if possible, strange ones
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman
Output
Note: Shown below are only the strings that matched from the above Input
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman
Explanation
I used a define block to create definitions. You can look at each definition to see how it works. In general, I use \p{.} where . is replaced with some pointer to a Unicode character group (i.e \p{L} is any letter from any language - this will not work in most flavours of regex, but it does allow the regex to be much more simplified if available, which is why I used it).
If you need anything else explained, don't hesitate to ask me and I'll do my best, but regex101 should be able to explain anything you're wondering about regex.

how to prepare transactional dataset for association rule mining in RapidMiner?

I have a dataset like this:
abelia,fl,nc
esculentus,ct,dc,fl,il,ky,la,md,mi,ms,nc,sc,va,pr,vi
abelmoschus moschatus,hi,pr*
dataset link:
My dataset haven't any attribute declaration. I want apply association rules on my dataset. I want to be like this dataset.
plant fl nc ct dc .....
abelia 1 1 0 0
.....
ELKI contains a parser that can read the input as is. Maybe Rapidminer does so, too - or you should write a parser for this format! With the ELKI parameters
-dbc.in /tmp/plants.data
-dbc.parser SimpleTransactionParser -parser.colsep ,
-algorithm itemsetmining.associationrules.AssociationRuleGeneration
-itemsetmining.minsupp 0.10
-associationrules.interestingness Lift
-associationrules.minmeasure 7.0
-resulthandler ResultWriter -out /tmp/rules
we can find all association rules with support >= 10%, Lift >= 7.0, and write them to the folder /tmp/rules (there is currently no visualization of association rules in ELKI):
For example, this finds the rules
sc, va, ga: 3882 --> nc, al: 3529 : 7.065536626573297
va, nj: 4036 --> md, pa: 3528 : 7.206260507764794
So plants that occur in South Carolina, Virigina, and Georgia will also occur in North Carolina and Alabama. NC is not much of a surprise, given that it is inbetween of SC and VA, but Alabama is interesting.
The second rule is Virigina and New Jersey imply Maryland (inbetween the two) and Pennsylvania. Also a very plausible rule, supported by 3528 cases.
I did my work with this python script:
import csv
abbrs = ['states', 'ab', 'ak', 'ar', 'az', 'ca', 'co', 'ct',
'de', 'dc', 'of', 'fl', 'ga', 'hi', 'id', 'il', 'in',
'ia', 'ks', 'ky', 'la', 'me', 'md', 'ma', 'mi', 'mn',
'ms', 'mo', 'mt', 'ne', 'nv', 'nh', 'nj', 'nm', 'ny',
'nc', 'nd', 'oh', 'ok', 'or', 'pa', 'pr', 'ri', 'sc',
'sd', 'tn', 'tx', 'ut', 'vt', 'va', 'vi', 'wa', 'wv',
'wi', 'wy', 'al', 'bc', 'mb', 'nb', 'lb', 'nf', 'nt',
'ns', 'nu', 'on', 'qc', 'sk', 'yt']
with open("plants.data.txt", encoding = "ISO-8859-1") as f1, open("plants.data.csv", "a") as f2:
csv_f2 = csv.writer(f2, delimiter=',')
csv_f2.writerow(abbrs)
csv_f1 = csv.reader(f1)
for row in csv_f1:
new_row = [row[0]]
for abbr in abbrs:
if abbr in row:
new_row.append(1)
else:
new_row.append(0)
csv_f2.writerow(new_row)
If all of the values are single words, you can use text mining extension in Rapidminer to transform them into variables and then run association rule mining methods on them.

Regular expression and csv | Output more readable

I have a text which contains different news articles about terrorist attacks. Each article starts with an html tag (<p>Advertisement) and I would like to extract from each article a specific information: the number of people wounded in the terrorist attacks.
This is a sample of the text file and how the articles are separated:
[<p>Advertisement , By MILAN SCHREUER and ALISSA J. RUBIN OCT. 5, 2016
, BRUSSELS — A man wounded 2 police officers with a knife in Brussels around noon on Wednesday in what the authorities called “a potential terrorist attack.” , The two officers were attacked on the Boulevard Lambermont.....]
[<p>Advertisement ,, By KAREEM FAHIM and MOHAMAD FAHIM ABED JUNE 30, 2016
, At least 33 people were killed and 25 were injured when the Taliban bombed buses carrying police cadets on the outskirts of Kabul, Afghanistan, on Thursday. , KABUL, Afghanistan — Taliban insurgents bombed a convoy of buses carrying police cadets on the outskirts of Kabul, the Afghan capital, on Thursday, killing at least 33 people, including four civilians, according to government officials and the United Nations. , During a year...]
This is my code so far:
text_open = open("News_cleaned_definitive.csv")
text_read = text_open.read()
splitted = text.read.split("<p>")
pattern= ("wounded (\d+)|(\d+) were wounded|(\d+) were injured")
for article in splitted:
result = re.findall(pattern,article)
The output that I get is:
[]
[]
[]
[('', '40', '')]
[('', '150', '')]
[('94', '', '')]
And I would like to make the output more readable and then save it as csv file:
article_1,0
article_2,0
article_3,40
article_3,150
article_3,94
Any suggestion in how to make it more readable?
I rewrote your loop like this and merged with csv write since you requested it:
import csv
with open ("wounded.csv","w",newline="") as f:
writer = csv.writer(f, delimiter=",")
for i,article in enumerate(splitted):
result = re.findall(pattern,article)
nb_casualties = sum(int(x) for x in result[0] if x) if result else 0
row=["article_{}".format(i+1),nb_casualties]
writer.writerow(row)
get index of the article using enumerate
sum the number of victims (in case more than 1 group matches) using a generator comprehension to convert to integer and pass it to sum, that only if something matched (ternary expression checks that)
create the row
print it, or optionally write it as row (one row per iteration) of a csv.writer object.

Extract name from email using regex in R

I have a string - which is chain of emails, i needed to extract the name of the sender (From :). Find below a sample of email
str1 <- 'From : Wendy YEOW (SLA) To : xxxx#lt.org Subject : RE: OneService#S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx#lt.org Subject : RE: OneService#S
From: Siti Zaharah RAMAN (ARKS) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx#lt.org Subject : RE: OneService#S
From: SLA Enquiry (SLA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx#lt.org Subject : RE: OneService#S
From: Chin Hwang LAU (TA) Sent: Friday, 5 June, 2015 5:26 PM To : xxxx#lt.org Subject : RE: OneService#S'
I have the below code - to extract the names
str_extract_all(string=str1,pattern="\\b(From\\s*[:]+\\s*(\\w*))\\b")[[1]]
[1] "From : Wendy" "From: SLA" "From: Siti" "From: SLA" "From: Chin"
But my desired output is:
[1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)" "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)" "Chin Hwang LAU (TA)"
You can use strsplit. There's no need for gsub here.
strsplit(str1, "From ?: | (To|Sent) ?:.*?(\\nFrom ?: |$)")[[1]][-1]
# [1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)" "Siti Zaharah RAMAN (ARKS)"
# [4] "SLA Enquiry (SLA)" "Chin Hwang LAU (TA)"
The regex basically consists of two parts:
"From ?: ": This ist the beginning of the string. The split returns an empty string and the rest of the original string.
" (To|Sent) ?:.*?(\\nFrom ?: |$)": This regex represents the text after the name. It includes the substring starting with "To" or "Sent" and ending with a line break ("\\n") followed by the next "From" or the end of the string ("$").
Finally, the [-1] is necessary to remove the empty string (preceding the first "From").
Try this regular expression together with strsplit():
gsub("From *: (.*?) (To|Sent).*", "\\1", strsplit(str1, "\n")[[1]])
[1] "Wendy YEOW (SLA)"
[2] "SLA Enquiry (SLA)"
[3] "Siti Zaharah RAMAN (ARKS)"
[4] "SLA Enquiry (SLA)"
[5] "Chin Hwang LAU (TA)"
This works because I am using a back reference (\\1) to extract the wildcard in the first set of parentheses.
Not much elegant, but you can try:
gsub(" *(From|To|Sent) *:? *","",regmatches(str1,gregexpr("From *:[^:]+",str1))[[1]])
#[1] "Wendy YEOW (SLA)" "SLA Enquiry (SLA)"
#[3] "Siti Zaharah RAMAN (ARKS)" "SLA Enquiry (SLA)"
#[5] "Chin Hwang LAU (TA)"

Split line with perl

I have a multiline credits with missing a few commas:
rendező: Joe Carnahan forgatókönyvíró: Brian Bloom, Michael Brandt, Skip Woods zeneszerző: Alan Silvestri operatőr: Mauro Fiore producer: Stephen J. Cannell, Jules Daly, Ridley Scott szereplő(k): Liam Neeson (John 'Hannibal' Smith ezredes) Bradley Cooper (Templeton 'Szépfiú' Peck hadnagy) szinkronhang: Gáti Oszkár (John 'Hannibal' (Smith magyar hangja)) Rajkai Zoltán (Templeton 'Faceman' Peck magyar hangja)
This leads to inability to split line by commas:
$credits (split /, */, $line):
I want to split after comma and if not exist comma between credits, split after first credits (ex.):
rendező: Joe Carnahan
forgatókönyvíró: Brian Bloom
Michael Brandt
Skip Woods
zeneszerző: Alan Silvestri
operatőr: Mauro Fiore
producer: Stephen J. Cannell
Jules Daly
Ridley Scott
szereplő(k): Liam Neeson (John 'Hannibal' Smith ezredes)
Bradley Cooper (Templeton 'Szépfiú' Peck hadnagy)
szinkronhang: Gáti Oszkár (John 'Hannibal' (Smith magyar hangja))
Rajkai Zoltán (Templeton 'Faceman' Peck magyar hangja)
Thanks
So you can split by a comma-space in most cases, but otherwise by a space character preceded by a right parenthesis. This would be:
/, |(?<=\)) /
Or, perhaps (?) more clearly:
/,[[:space:]]|(?<=\))[[:space:]]/
The pipe character will make for a disjunctive match between what's on either side of it. But there's also parsing out the roles, and the entire string is full of non-ascii characters.
Script:
use strict;
use warnings;
use utf8;
use Data::Dump 'dump';
my $big_string = q/rendező: ... hangja)/;
my #credits = map {
my ($title, $names) = /([[:alpha:]()]+): (.+)/;
my #names = split /,[[:space:]]|(?<=\))[[:space:]]/, $names;
my $credit = { $title => \#names };
} split / (?=[[:alpha:]()]+:)/, $big_string;
binmode STDOUT, ':utf8';
print dump \#credits;
Output:
[
{ rendező => ["Joe Carnahan"] },
{
forgatókönyvíró => ["Brian Bloom", "Michael Brandt", "Skip Woods"],
},
{ zeneszerző => ["Alan Silvestri"] },
{ operatőr => ["Mauro Fiore"] },
{
producer => ["Stephen J. Cannell", "Jules Daly", "Ridley Scott"],
},
{
"szerepl\x{151}(k)" => [
"Liam Neeson (John 'Hannibal' Smith ezredes)",
"Bradley Cooper (Templeton 'Sz\xE9pfi\xFA' Peck hadnagy)",
],
},
{
szinkronhang => [
"G\xE1ti Oszk\xE1r (John 'Hannibal' (Smith magyar hangja))",
"Rajkai Zolt\xE1n (Templeton 'Faceman' Peck magyar hangja)",
],
},
]
Notes:
An array of hashrefs is used to preserve the order of the list.
The utf8 pragma will make the [:alpha:] construct utf8-aware.
Given Perl >= v5.10, The utf8::all pragma can replace utf8 and also remove the need to call &binmode prior to output.
Lookarounds ((?=), (?<=), etc.) can be tricky; see perlre and this guide for good information on them.
I think you can try to set up a regular expression.
you can substitute any 'word:' with '\nword:'
in the same way you can substitte ',' with ',\n'
to give a look to regular expression check this page:
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
the 2 roules should be something similar to:
$newstr ~= ($str =~ tr/[a-zA-Z]+:/(\n)[a-Z]+:/);
it's just a guess... not really aware of Perl syntax