PySpark using Regexp_extract and Col to Create Dataset - regex

I need help creating a dataset that shows both the first name and last name of people who live in Texas and the area code of their phone numbers (phone1). This is the coding that I tried to use and this is the dataset that I was given.
from pyspark.sql.functions import regexp_extract, col
regexp_extract(col('first_name + last_name'), '.by\s+(\w+)', 1))
first_name last_name company_name address city county state zip phone1
Billy Thornton Qdoba 8142 Yougla Road Dallas Fort Worth TX 34218 689-956-0765
Joe Swanson Beachfront 9243 Trace Street Miami Dade FL 56432 890-780-9674
Kevin Knox MSG 7683 Brooklyn Ave New York New York NY 56987 850-342-1123
Bill Lamb AFT 6394 W Beast Dr Houston Galveston TX 32804 407-413-4842
Raylene Kampa Hermar Inc 2046 SW Nylin Rd Elkhart Elkhart IN 46514 574-499-1454

Now I see. Your phone number status is good to split, so use split.
df.show()
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|first_name|last_name|company_name| address| city| county|state| zip| phone1|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
| Billy| Thornton| Qdoba| 8142 Yougla Road| Dallas|Fort Worth| TX|34218|689-956-0765|
| Joe| Swanson| Beachfront|9243 Trace Street| Miami| Dade| FL|56432|890-780-9674|
| Kevin| Knox| MSG|7683 Brooklyn Ave|New York| New York| NY|56987|850-342-1123|
| Bill| Lamb| AFT| 6394 W Beast Dr| Houston| Galveston| TX|32804|407-413-4842|
| Raylene| Kampa| Hermar Inc| 2046 SW Nylin Rd| Elkhart| Elkhart| IN|46514|574-499-1454|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
df.filter("state = 'TX'") \
.withColumn('area_code', split('phone1', "-")[0].alias('area_code')) \
.select('first_name', 'last_name', 'state', 'area_code') \
.show()
+----------+---------+-----+---------+
|first_name|last_name|state|area_code|
+----------+---------+-----+---------+
| Billy| Thornton| TX| 689|
| Bill| Lamb| TX| 407|
+----------+---------+-----+---------+

Related

Select value in column that matches value in list (UPDATED FOR CLARITY)

If I have a column of street addresses and want to select only the address's directional, what syntax would I use to accomplish that in Excel Power Query?
For instance, how do I get "NE" from "357 Pyrite Dr NE" even if the address is incorrectly formatted as "357 NE Pyrite Dr" or "357 Pyrite NE Dr"? Likewise, how would I get "NW" from "506 Mark NW St"?
As far as I can figure out, I would hit add column > custom column and enter a syntax similar to the following...
= if List.ContainsAny([Address], {"NE", "NW", "SE", "SW"}) = TRUE then Text.Select([Address], {"NE", "NW", "SE", "SW"} else null
...except I know that's not the correct syntax since it always produces an error. The same thing happens when I replace "Text.Select" with "List.Select" in the above formula.
For greater clarification, I'm posting the query as it stands now, whittled down to one column from a table with 100 columns and 4000 rows:
let
Source = q_NMAACC,
#"Removed Other Columns" = Table.SelectColumns(Source,{"Address - Street 1", "Address - Street 2"}),
#"Merged Columns" = Table.CombineColumns(#"Removed Other Columns",{"Address - Street 1", "Address - Street 2"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"Street Address"),
#"Trimmed Text" = Table.TransformColumns(#"Merged Columns",{{"Street Address", Text.Trim, type text}}),
#"Filtered Rows" = Table.SelectRows(#"Trimmed Text", each [Street Address] <> null and [Street Address] <> "")
in
#"Filtered Rows"
Here are the first 25 rows to give you some data to work off.
Street Address
PO Box 3416 Nr57 #165a
1016 Copper NE Ave Apt C
217 Garcia St NE
232 17th St SE
560 60th St NW
2935 Madeira Dr NE
9677 Eagle Ranch Rd NW Apt 415
5320 Roanoke Ave NW
17 Hwy 304
HCR 79 Box 46
6524 Camino Rojo
3518 Vail Ave SE
6412 Torreon Dr NE
6136 Flor de Rio Ct NW
1712 36th Street SE
734 Columbia Street
716 Morning Meadows Dr NE
6601 Tennyson St NE Apt 10207
Alamo - Rio Salado PO Box 804
206 Aragon Rd
6901 Verano Ct NW
6709 Siesta Pl NE
10 Meadow Hills Loop
98 Avenida Jardin
6903 Prairie Rd NE Apt 216
Try
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
List={"NE","NW","SW","SE"},
LocateTable = Table.FromList(List, null, {"Locate"}),
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
You could also use another table to contain the search criteria
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
Find = Table.AddColumn(Source, "Found", (x) => Text.Combine(Table.SelectRows(LocateTable, each Text.Contains(x[Address],[Locate], Comparer.OrdinalIgnoreCase))[Locate],", "))
in Find
the , Comparer.OrdinalIgnoreCase part is ignoring case for comparison, which you can remove if you want to match case

pyspark column transformation

I have two predefined lists as below.
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
I have a pyspark dataframe as below. I need to add a third column (State) in the dataframe depending upon the name in the second column after searching in the lists(City).
df:
Num City
1 Bengal
2 Goa
3 Bombay
4 Bihar
Expected output:
Num City State
1 Bengal East
2 Goa West
3 Bombay West
4 Bihar East
Thanks
You can use the isin function.
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
from pyspark.sql.functions import when, col
df.withColumn("state", when(col("City").isin(East), "East")\
.when(col("City").isin(West), "West").otherwise(None)).show()
+---+------+-----+
|Num| City|state|
+---+------+-----+
| 1|Bengal| East|
| 2| Goa| West|
| 3|Bombay| West|
| 4| Bihar| East|
+---+------+-----+
I could do only in pandas as below. Since the dataset is huge, I am trying do convert this into pyspark. Thanks.
Pandas code as below
def map_state(name):
#print(name)
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
if name in East:
return 'East'
if name in West:
return 'West'
else:
return name
df['State'] = df['City'].apply(map_state)

How to delete words from a dataframe column that are present in dictionary in Pandas

An extension to :
Removing list of words from a string
I have following dataframe and I want to delete frequently occuring words from df.name column:
df :
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
which will result in
df2 :
word freq
Clinton 4
Bill 3
James 3
Clark 3
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100.
So above code takes lots of time to run because of complex search.
Is there any effiecient way to make it faster??
Following is a desired output :
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
Thanks in advance!!!!!!!
Use replace by regex created by joined all values of column word, last strip traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
Another solution is add \s* for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam

Parse Wikipedia Infobox with Go?

I am trying to parse the Infobox for some wikipedia articles and cannot seem to figure it out. I have downloaded the files and for Albert Einstein and my attempt to parse the Infobox looks like this:
package main
import (
"log"
"regexp"
)
func main() {
st := `{{redirect|Einstein|other uses|Albert Einstein (disambiguation)|and|Einstein (disambiguation)}}
{{pp-semi-indef}}
{{pp-move-indef}}
{{Good article}}
{{Infobox scientist
| name = Albert Einstein
| image = Einstein 1921 by F Schmutzer - restoration.jpg
| caption = Albert Einstein in 1921
| birth_date = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = {{nowrap|[[Princeton, New Jersey]], U.S.}}
| children = [[Lieserl Einstein|"Lieserl"]] (1902–1903?)<br />[[Hans Albert Einstein|Hans Albert]] (1904–1973)<br />[[Eduard Einstein|Eduard "Tete"]] (1910–1965)
| spouse = [[Mileva Marić]] (1903–1919)<br />{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence = Germany, Italy, Switzerland, Austria (today: [[Czech Republic]]), Belgium, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* Austria of the [[Austro-Hungarian Empire]] (1911–1912)
* Germany (1914–1933)
* United States (1940–1955)
}}
| ethnicity = Jewish
| fields = [[Physics]], [[philosophy]]
| workplaces = {{Plainlist|
* [[Swiss Patent Office]] ([[Bern]]) (1902–1909)
* [[University of Bern]] (1908–1909)
* [[University of Zurich]] (1909–1911)
* [[Karl-Ferdinands-Universität|Charles University in Prague]] (1911–1912)
* [[ETH Zurich]] (1912–1914)
* [[Prussian Academy of Sciences]] (1914–1933)
* [[Humboldt University of Berlin]] (1914–1917)
* [[Kaiser Wilhelm Institute]] (director, 1917–1933)
* [[German Physical Society]] (president, 1916–1918)
* [[Leiden University]] (visits, 1920–)
* [[Institute for Advanced Study]] (1933–1955)
* [[Caltech]] (visits, 1931–1933)
}}
| alma_mater = {{Plainlist|
* [[ETH Zurich|Swiss Federal Polytechnic]] (1896–1900; B.A., 1900)
* [[University of Zurich]] (Ph.D., 1905)
}}
| doctoral_advisor = [[Alfred Kleiner]]
| thesis_title = Eine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions)
| thesis_url = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf
| thesis_year = 1905
| academic_advisors = [[Heinrich Friedrich Weber]]
| influenced = {{Plainlist|
* [[Ernst G. Straus]]
* [[Nathan Rosen]]
* [[Leó Szilárd]]
}}
| known_for = {{Plainlist|
* [[General relativity]] and [[special relativity]]
* [[Photoelectric effect]]
* ''[[Mass–energy equivalence|E=mc<sup>2</sup>]]''
* Theory of [[Brownian motion]]
* [[Einstein field equations]]
* [[Bose–Einstein statistics]]
* [[Bose–Einstein condensate]]
* [[Gravitational wave]]
* [[Cosmological constant]]
* [[Classical unified field theories|Unified field theory]]
* [[EPR paradox]]
}}
| awards = {{Plainlist|
* [[Barnard Medal for Meritorious Service to Science|Barnard Medal]] (1920)
* [[Nobel Prize in Physics]] (1921)
* [[Matteucci Medal]] (1921)
* [[ForMemRS]] (1921)<ref name="frs" />
* [[Copley Medal]] (1925)<ref name="frs" />
* [[Max Planck Medal]] (1929)
* [[Time 100: The Most Important People of the Century|''Time'' Person of the Century]] (1999)
}}
| signature = Albert Einstein signature 1934.svg
}}
'''Albert Einstein''' ({{IPAc-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=Wells|first=John|authorlink=John C. Wells|title=Longman Pronunciation Dictionary|publisher=Pearson Longman|edition=3rd|date=April 3, 2008|isbn=1-4058-8118-6}}</ref> {{IPA-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|Albert Einstein german.ogg}}; 14 March 1879 – 18 April 1955) was a German-born<!-- Please do not change this—see talk page and its many archives.-->
[[theoretical physicist]]. He developed the [[general theory of relativity]], one of the two pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = Whittaker | first1 = E. | authorlink = E. T. Whittaker| doi = 10.1098/rsbm.1955.0005 | title = Albert Einstein. 1879–1955 | journal = [[Biographical Memoirs of Fellows of the Royal Society]] | volume = 1 | pages = 37–67 | date = 1 November 1955| jstor = 769242}}</ref><ref name="YangHamilton2010">{{cite book|author1=Fujia Yang|author2=Joseph H. Hamilton|title=Modern Atomic and Nuclear Physics|date=2010|publisher=World Scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} Einstein's work is also known for its influence on the [[philosophy of science]].<ref>{{Citation |title=Einstein's Philosophy of Science |url=http://plato.stanford.edu/entries/einstein-philscience/#IntWasEinEpiOpp |we......
`
re := regexp.MustCompile(`{{Infobox(?s:.*?)}}`)
log.Println(re.FindAllStringSubmatch(st, -1))
}
I am trying to put each of the items from the infobox into a struct or a map:
m["name"] = "Albert Einstein"
m["image"] = "Einstein...."
...
...
m["death_date"] = "{{Death date and age|df=yes|1955|4|18|1879|3|14}}"
...
...
I can't even seem to isolate the infobox. I get:
[[{{Infobox scientist
| name = Albert Einstein
| image = Einstein 1921 by F Schmutzer - restoration.jpg
| caption = Albert Einstein in 1921
| birth_date = {{Birth date|df=yes|1879|3|14}}]]
The Albert Einstein entry in the API can be found at:
https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content&format=json
EDIT:
Based on the accepted answer to this question the I tried the following regex:
(?=\{Infobox)(\{([^{}]|(?1))*\})
but get:
panic: regexp: Compile(`(?=\{Infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported Perl syntax: `(?=`
EDIT #2:
If there's a way to extract the information via their API then I'll take that....I've been reading through the docs and can't find it.
I made a regex that might work for you:
^\s*\|\s*([^\s]+)\s*=\s*(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)
Explanation
This part: ^\s*\|\s*([^\s]+)\s*=\s* matches the start of lines like:
| <the_label> =
Continuing on the same line, this part: (\{\{Plainlist\|(?:\n\s*\*.*)*|.*) will match lists:
{{Plainlist|
* [[Ernst G. Straus]]
* [[Nathan Rosen]]
* [[Leó Szilárd]]
(Note that it may omit the final }}. Oh well.)
If there is no list, it matches until the end of the line.

Remove regex pattern from string and store in csv

I am trying to clean up a CSV by using regex. I have accomplished the first part which extracts the regex pattern from the address table and writes it to the street_numb field. The part I need help with is removing that same pattern from the street field so I only end up with the following (i.e., Steinway St, 31 St, 82nd Rd, and 19th St) stored in the street field. Hence these values would be removed (-78, -45, -35, -54) from the street field.
b street_numb street address zipcode
1 246 FIFTH AVE 246 FIFTH AVE 11215
2 30 -78 -78 STEINWAY ST 30 -78 STEINWAY ST 11016
3 25 -45 -45 31ST ST 25 -45 31ST ST 11102
4 123 -35 -35 82ND RD 123 -35 82ND RD 11415
5 22 -54 -54 19TH ST 22 -54 19TH ST 11105
Sample Data (above)
import csv
import re
path = '/Users/darchcruise/Desktop/bldg_zip_codes.csv'
with open(path, 'rU') as infile, open(path+'out.csv', 'w') as outfile:
fieldnames = ['b', 'street_numb', 'street', 'address', 'zipcode']
readablefile = csv.DictReader(infile)
writablefile = csv.DictWriter(outfile, fieldnames=fieldnames)
for row in readablefile:
add = re.match(r'\d+\s*-\s*\d+', row['address'])
if add:
row['street_numb'] = add.group()
# row['street'] = remove re.string (add.group()) from street field
writablefile.writerow(row)
else:
writablefile.writerow(row)
What code in line 12 (# remove re.string from row['street']) could be used to resolve my issue (removing -78, -45, -35, -54 from the street field)?
You can use capturing group with findall like this
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][0]-->gives street number
[x for x in re.findall("(\d+\s*(-\s*\d+\s+)?)((\w|\s)+)", row['address'])][0][2]-->gives address