Parse Wikipedia Infobox with Go? - regex

I am trying to parse the Infobox for some wikipedia articles and cannot seem to figure it out. I have downloaded the files and for Albert Einstein and my attempt to parse the Infobox looks like this:
package main
import (
"log"
"regexp"
)
func main() {
st := `{{redirect|Einstein|other uses|Albert Einstein (disambiguation)|and|Einstein (disambiguation)}}
{{pp-semi-indef}}
{{pp-move-indef}}
{{Good article}}
{{Infobox scientist
| name = Albert Einstein
| image = Einstein 1921 by F Schmutzer - restoration.jpg
| caption = Albert Einstein in 1921
| birth_date = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = {{nowrap|[[Princeton, New Jersey]], U.S.}}
| children = [[Lieserl Einstein|"Lieserl"]] (1902–1903?)<br />[[Hans Albert Einstein|Hans Albert]] (1904–1973)<br />[[Eduard Einstein|Eduard "Tete"]] (1910–1965)
| spouse = [[Mileva Marić]] (1903–1919)<br />{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence = Germany, Italy, Switzerland, Austria (today: [[Czech Republic]]), Belgium, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* Austria of the [[Austro-Hungarian Empire]] (1911–1912)
* Germany (1914–1933)
* United States (1940–1955)
}}
| ethnicity = Jewish
| fields = [[Physics]], [[philosophy]]
| workplaces = {{Plainlist|
* [[Swiss Patent Office]] ([[Bern]]) (1902–1909)
* [[University of Bern]] (1908–1909)
* [[University of Zurich]] (1909–1911)
* [[Karl-Ferdinands-Universität|Charles University in Prague]] (1911–1912)
* [[ETH Zurich]] (1912–1914)
* [[Prussian Academy of Sciences]] (1914–1933)
* [[Humboldt University of Berlin]] (1914–1917)
* [[Kaiser Wilhelm Institute]] (director, 1917–1933)
* [[German Physical Society]] (president, 1916–1918)
* [[Leiden University]] (visits, 1920–)
* [[Institute for Advanced Study]] (1933–1955)
* [[Caltech]] (visits, 1931–1933)
}}
| alma_mater = {{Plainlist|
* [[ETH Zurich|Swiss Federal Polytechnic]] (1896–1900; B.A., 1900)
* [[University of Zurich]] (Ph.D., 1905)
}}
| doctoral_advisor = [[Alfred Kleiner]]
| thesis_title = Eine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions)
| thesis_url = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf
| thesis_year = 1905
| academic_advisors = [[Heinrich Friedrich Weber]]
| influenced = {{Plainlist|
* [[Ernst G. Straus]]
* [[Nathan Rosen]]
* [[Leó Szilárd]]
}}
| known_for = {{Plainlist|
* [[General relativity]] and [[special relativity]]
* [[Photoelectric effect]]
* ''[[Mass–energy equivalence|E=mc<sup>2</sup>]]''
* Theory of [[Brownian motion]]
* [[Einstein field equations]]
* [[Bose–Einstein statistics]]
* [[Bose–Einstein condensate]]
* [[Gravitational wave]]
* [[Cosmological constant]]
* [[Classical unified field theories|Unified field theory]]
* [[EPR paradox]]
}}
| awards = {{Plainlist|
* [[Barnard Medal for Meritorious Service to Science|Barnard Medal]] (1920)
* [[Nobel Prize in Physics]] (1921)
* [[Matteucci Medal]] (1921)
* [[ForMemRS]] (1921)<ref name="frs" />
* [[Copley Medal]] (1925)<ref name="frs" />
* [[Max Planck Medal]] (1929)
* [[Time 100: The Most Important People of the Century|''Time'' Person of the Century]] (1999)
}}
| signature = Albert Einstein signature 1934.svg
}}
'''Albert Einstein''' ({{IPAc-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=Wells|first=John|authorlink=John C. Wells|title=Longman Pronunciation Dictionary|publisher=Pearson Longman|edition=3rd|date=April 3, 2008|isbn=1-4058-8118-6}}</ref> {{IPA-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|Albert Einstein german.ogg}}; 14 March 1879 – 18 April 1955) was a German-born<!-- Please do not change this—see talk page and its many archives.-->
[[theoretical physicist]]. He developed the [[general theory of relativity]], one of the two pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = Whittaker | first1 = E. | authorlink = E. T. Whittaker| doi = 10.1098/rsbm.1955.0005 | title = Albert Einstein. 1879–1955 | journal = [[Biographical Memoirs of Fellows of the Royal Society]] | volume = 1 | pages = 37–67 | date = 1 November 1955| jstor = 769242}}</ref><ref name="YangHamilton2010">{{cite book|author1=Fujia Yang|author2=Joseph H. Hamilton|title=Modern Atomic and Nuclear Physics|date=2010|publisher=World Scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} Einstein's work is also known for its influence on the [[philosophy of science]].<ref>{{Citation |title=Einstein's Philosophy of Science |url=http://plato.stanford.edu/entries/einstein-philscience/#IntWasEinEpiOpp |we......
`
re := regexp.MustCompile(`{{Infobox(?s:.*?)}}`)
log.Println(re.FindAllStringSubmatch(st, -1))
}
I am trying to put each of the items from the infobox into a struct or a map:
m["name"] = "Albert Einstein"
m["image"] = "Einstein...."
...
...
m["death_date"] = "{{Death date and age|df=yes|1955|4|18|1879|3|14}}"
...
...
I can't even seem to isolate the infobox. I get:
[[{{Infobox scientist
| name = Albert Einstein
| image = Einstein 1921 by F Schmutzer - restoration.jpg
| caption = Albert Einstein in 1921
| birth_date = {{Birth date|df=yes|1879|3|14}}]]
The Albert Einstein entry in the API can be found at:
https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content&format=json
EDIT:
Based on the accepted answer to this question the I tried the following regex:
(?=\{Infobox)(\{([^{}]|(?1))*\})
but get:
panic: regexp: Compile(`(?=\{Infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported Perl syntax: `(?=`
EDIT #2:
If there's a way to extract the information via their API then I'll take that....I've been reading through the docs and can't find it.

I made a regex that might work for you:
^\s*\|\s*([^\s]+)\s*=\s*(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)
Explanation
This part: ^\s*\|\s*([^\s]+)\s*=\s* matches the start of lines like:
| <the_label> =
Continuing on the same line, this part: (\{\{Plainlist\|(?:\n\s*\*.*)*|.*) will match lists:
{{Plainlist|
* [[Ernst G. Straus]]
* [[Nathan Rosen]]
* [[Leó Szilárd]]
(Note that it may omit the final }}. Oh well.)
If there is no list, it matches until the end of the line.

Related

Convert list to dataframe and then join with different dataframe in pyspark

I am working with pyspark dataframes.
I have a list of date type values:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
Also I have a dataframe (mean_df) that has only one column (mean).
+----+
|mean|
+----+
|67 |
|78 |
|98 |
+----+
Now I want to convert date_list into a column and join with mean_df:
expected output:
+------------+----+
|dates |mean|
+------------+----+
|2018-01-19 | 67|
|2018-01-20 | 78|
|2018-01-17 | 98|
+------------+----+
I tried converting list to dataframe (date_df) :
date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])
and then used monotonically_increasing_id() with new column name "idx" for both date_df and mean_df and used join :
date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")
I get error of timeout exceeded so I changed default broadcastTimeout 300s to 6000s
spark.conf.set("spark.sql.broadcastTimeout", 6000)
But it did not work at all. Also I am working with a really small sample of data right now. The actual data is large enough.
Snippet of code:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))
mean_list.append(mean1)
mean_df = reduce(DataFrame.unionAll, mean_list)
You can use withColumn and lit to add the date to the dataframe:
import pyspark.sql.functions as F
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
mean1 = h2_df1.select(F.mean(F.col('count_before')).alias('mean_before')).withColumn('date', F.lit(d))
mean_list.append(mean1)
mean_df = reduce(DataFrame.unionAll, mean_list)

PySpark using Regexp_extract and Col to Create Dataset

I need help creating a dataset that shows both the first name and last name of people who live in Texas and the area code of their phone numbers (phone1). This is the coding that I tried to use and this is the dataset that I was given.
from pyspark.sql.functions import regexp_extract, col
regexp_extract(col('first_name + last_name'), '.by\s+(\w+)', 1))
first_name last_name company_name address city county state zip phone1
Billy Thornton Qdoba 8142 Yougla Road Dallas Fort Worth TX 34218 689-956-0765
Joe Swanson Beachfront 9243 Trace Street Miami Dade FL 56432 890-780-9674
Kevin Knox MSG 7683 Brooklyn Ave New York New York NY 56987 850-342-1123
Bill Lamb AFT 6394 W Beast Dr Houston Galveston TX 32804 407-413-4842
Raylene Kampa Hermar Inc 2046 SW Nylin Rd Elkhart Elkhart IN 46514 574-499-1454
Now I see. Your phone number status is good to split, so use split.
df.show()
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|first_name|last_name|company_name| address| city| county|state| zip| phone1|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
| Billy| Thornton| Qdoba| 8142 Yougla Road| Dallas|Fort Worth| TX|34218|689-956-0765|
| Joe| Swanson| Beachfront|9243 Trace Street| Miami| Dade| FL|56432|890-780-9674|
| Kevin| Knox| MSG|7683 Brooklyn Ave|New York| New York| NY|56987|850-342-1123|
| Bill| Lamb| AFT| 6394 W Beast Dr| Houston| Galveston| TX|32804|407-413-4842|
| Raylene| Kampa| Hermar Inc| 2046 SW Nylin Rd| Elkhart| Elkhart| IN|46514|574-499-1454|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
df.filter("state = 'TX'") \
.withColumn('area_code', split('phone1', "-")[0].alias('area_code')) \
.select('first_name', 'last_name', 'state', 'area_code') \
.show()
+----------+---------+-----+---------+
|first_name|last_name|state|area_code|
+----------+---------+-----+---------+
| Billy| Thornton| TX| 689|
| Bill| Lamb| TX| 407|
+----------+---------+-----+---------+

How to extract components of a disorganized string variable in Stata?

I have a text variable showing patient prescription that looks quite messy like this:
PatientRx
ACETAZOLAMIDE 250MG TABLET- 100
ADAPALENE + BENZOYL 0.1% + 2.5% GEL-..
ADRENALINE/EPIPEN 300MCG/0.3ML INJ..
ALENDRONATE + COLECA 70MG + 140MCG TA..
ALLOPURINOL 100MG TABLET- 100
ALUM HYDROX + MAG HY 250+120+120MG/5M..
AMILORIDE + HYDROCHL 5MG + 50MG HCL T..
While I haven't looked through all these values, some patterns may arise:
Often times there are more than one drugs and they are separated, for example by space and forward slash.
Drugs are also be separated with plus sign. But plus sign is also used between doses.
The rule related to space is very arbitrary, both at the beginning and in the middle of entry.
How can I extract only the names of the drugs into new variables? New variables should look like this:
Newvar1 Newvar2
ACETAZOLAMIDE
ADAPALENE BENZOYL
ADRENALINE EPIPEN
ALENDRONATE COLECA
and so on.
Some would reach first for regular expressions, which you might indeed need for the full problem. In addition note moss as installed by ssc install moss.
But it seems easiest, given the information in the example here, which is all we have to go on, to look for the position of the first numeric digit 0 to 9 and then parse what goes before. I don't know whether drug names ever contain numeric digits.
clear
input str40 sandbox
" ACETAZOLAMIDE 250MG TABLET- 100"
"ADAPALENE + BENZOYL 0.1% + 2.5% GEL-"
" ADRENALINE/EPIPEN 300MCG/0.3ML INJ"
"ALENDRONATE + COLECA 70MG + 140MCG TA"
" ALLOPURINOL 100MG TABLET- 100"
"ALUM HYDROX + MAG HY 250+120+120MG/5M"
" AMILORIDE + HYDROCHL 5MG + 50MG HCL T"
end
gen wherenum = .
quietly forval j = 0/9 {
replace wherenum = min(wherenum, strpos(sandbox, "`j'")) if strpos(sandbox, "`j'")
}
gen drug = substr(sandbox, 1, wherenum - 1)
split drug, parse(+ /)
l drug?, sep(0)
+---------------------------+
| drug1 drug2 |
|---------------------------|
1. | ACETAZOLAMIDE |
2. | ADAPALENE BENZOYL |
3. | ADRENALINE EPIPEN |
4. | ALENDRONATE COLECA |
5. | ALLOPURINOL |
6. | ALUM HYDROX MAG HY |
7. | AMILORIDE HYDROCHL |
+---------------------------+

How to refactor this code to make it cleanner ( ruby on rails )?

I want to refactor my code in ruby on rails.
In Order.rb I have:
def self.filter_price range
case range.to_sym
when :highest
self.where("price > 10000")
when :higher
self.where(price: 5001..10000)
when :high
self.where(price: 1001..5000)
when :low
self.where(price: 501..1000)
when :lower
self.where(price: 1..500)
when :lowest
self.where(price: [0,nil])
else
self
end
end
In views, I have this slim html:
- price_range = [ [0,nil,"lowest"], [1,500,"lower"], [501,1000,"low"], [1001,5000,"high"], [5001,10000,"higher"], [10000,">","highest"] ]
- (0..5).each do |i|
tr
- if i == 0
th= "#{i}"
- else
th= "#{price_range[i][0]} - #{price_range[i][1]}"
td.text-right= Order.filter_price(price_range[i][2]).count
span.divider
= with_unit (Order.filter_price(price_range[i][2]).count.to_f / Order.count.to_f * 100.0).to_i, "%"
td.text-right
= Order.filter_price(price_range[i][2]).select{|o| o.replied?}.count
span.divider
= with_unit (Order.filter_price(price_range[i][2]).select{|o| o.replied?}.count.to_f / Order.all.select{|o| o.replied?}.count.to_f * 100.0).to_i, "%"
span.divider
= with_unit (Order.filter_price(price_range[i][2]).select{|o| o.replied?}.count.to_f / Order.filter_price(price_range[i][2]).count.to_f * 100.0).to_i, "%"
td.text-right
= Order.paid.filter_price(price_range[i][2]).count
span.divider
= with_unit (Order.paid.filter_price(price_range[i][2]).count.to_f / Order.paid.count.to_f * 100.0).to_i, "%"
span.margin
How can I remove price_range array to make the code cleaner and still get the same output result?
Can anyone help me with this, thanks in advance.
Here is what will see in views:
| Price | request | reply | Paid |
----------------------------------------------
| 0 | 68 | 19/15%/27% | 5/6% |
---------------------------------------------
|1 - 500 | 19 | .... | .... |
----------------------------------------------
|.... | .... | .... | .... |
The html code above is for looping through each row.
You can define price_range as a hash:
price_range = {lowest: '0', low: '1 - 500', high: '1 - 500', higher: '1 - 500', highest: '10000 >'}
You can then call Order.filter_price(price_range.keys[0]) # lowest and so on in a loop
th= "#{price_range[i][0]} - #{price_range[i][1]}"
could be then written as
th= "#{price_range.values[i]}" # if i = 1 then price_range.values[0]: "1 - 500"
You could also modify order.rb to return both
[Order.filter_price(price_range[i][2]).count, Order.filter_price(price_range[i][2]).select{|o| o.replied?}] at the same time. This will further clean up your code.

DataArray case-insensitive match that returns the index value of the match

I have a DataFrame inside of a function:
using DataFrames
myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
myservs
5x2 DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
How can I write the function to take a single parameter called server, case-insensitive match the server parameter in the myservs[:serverName] DataArray, and return the match's corresponding ipAddress?
In R this can be done by using
myservs$ipAddress[grep("server", myservs$serverName, ignore.case = T)]
I don't want it to matter if someone uses ElMo or Elmo as the server, or if the serverName is saved as elmo or ELMO.
I referenced how to accomplish the task in R and tried to do it using the DataFrames pkg, but I only did this because I'm coming from R and am just learning Julia. I asked a lot of questions from coworkers and the following is what we came up with:
This task is much cleaner if I was to stop thinking in terms of
vectors in R. Julia runs plenty fast iterating through a loop.
Even still, looping wouldn't be the best solution here. I was told to look into
Dicts (check here for an example). Dict(), zip(), haskey(), and
get() blew my mind. These have many applications.
My solution doesn't even need to use the DataFrames pkg, but instead
uses Julia's Matrix and Array data representations. By using let
we keep the global environment clutter free and the server name/ip
list stays hidden from view to those who are only running the
function.
In the sample code, I'm recreating the server matrix every time, but in reality/practice I'll have a permission restricted delimited file that gets read every time. This is OK for now since the delimited files are small, but this may not be efficient or the best way to do it.
# ONLY ALLOW THE FUNCTION TO BE SEEN IN THE GLOBAL ENVIRONMENT
let global myIP
# SERVER MATRIX
myservers = ["elmo" "12.345.6.7"; "bigBird" "12.345.6.8";
"Oscar" "12.345.6.9"; "gRover" "12.345.6.10";
"BERT" "12.345.6.11"]
# SERVER DICT
servDict = Dict(zip(pmap(lowercase, myservers[:, 1]), myservers[:, 2]))
# GET SERVER IP FUNCTION: INPUT = SERVER NAME; OUTPUT = IP ADDRESS
function myIP(servername)
sn = lowercase(servername)
get(servDict, sn, "That name isn't in the server list.")
end
end
​# Test it out
myIP("SLIMEY")
​#>​"That name isn't in the server list."
myIP("elMo"​)
#>​"12.345.6.7"
Here's one way:
julia> using DataFrames
julia> myservs = DataFrame(serverName = ["elmo", "bigBird", "Oscar", "gRover", "BERT"],
ipAddress = ["12.345.6.7", "12.345.6.8", "12.345.6.9", "12.345.6.10", "12.345.6.11"])
5x2 DataFrames.DataFrame
| Row | serverName | ipAddress |
|-----|------------|---------------|
| 1 | "elmo" | "12.345.6.7" |
| 2 | "bigBird" | "12.345.6.8" |
| 3 | "Oscar" | "12.345.6.9" |
| 4 | "gRover" | "12.345.6.10" |
| 5 | "BERT" | "12.345.6.11" |
julia> grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "") = Bool[isna(d) ? false : ismatch(Regex(pat, opts), d) for d in dat]
grep (generic function with 2 methods)
julia> myservs[:ipAddress][grep("bigbird", myservs[:serverName], "i")]
1-element DataArrays.DataArray{ASCIIString,1}:
"12.345.6.8"
EDIT
This grep works faster on my platform.
julia> function grep{T <: String}(pat::String, dat::DataArray{T}, opts::String = "")
myreg = Regex(pat, opts)
return convert(Array{Bool}, map(d -> isna(d) ? false : ismatch(myreg, d), dat))
end