Remove Unicode literals in a Dataframe string - regex

I have a Dataframe with resumes in, but they contain Unicode literals such as "\xe2\x80\x93".
I want to remove all of these values to prepare the text for processing.
My issue is that I have tried many recommended ways to remove these, and none seem to work when applied to the data in my df.
Text example:
"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service
\xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"
The part I am finding difficult is if I take this text and put it in a string variable such as y = <text> then using one of the following methods to deal with unicode literals:
print(re.sub(r'[^\x00-\x7F+]',' ', y)
print(y.encode('ascii',errors='ignore').decode('ascii'))
It will output:
"CORE COMPETENCIES
Benefits Administration Customer Service Cost Control Recruiting Acquisition Management"
As expected.
When I try this on the values in my Dataframe it simply does not seem to work.
I have tried the following (df is called resume):
resume = resume.apply(lambda x : re.sub(r'[^\x00-\x7F+]',' ',x))
resume = resume.apply(x.encode('ascii',errors='ignore').decode('ascii')
resume = resume.replace(re.sub(r'[^\x00-\x7F+]',' ',x)```
I have even tried:
for x in resume:
x = str(x)
x = (re.sub(r'[^\x00-\x7F+]',' ', x))
print(x)
and:
print(re.sub(r'[^\x00-\x7F+]',' ', resume[0])
Just to see if I could replicate the change when I apply these to a string variable but still no luck.
The dataframe is shape (368,0)
The dtype is object which I have tried converting to string but I believe it always stays as object.

Can you try this:
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
Assumptions:
The dataframe(df) has a 'text' column containing the resume strings with unicode literals.
This is how I tested it:
import pandas as pd
# created sample data - same example row inserted 5 times. Not ideal but just was trying to test
df = pd.DataFrame({"text": [b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management",
b"COMPETENCIES\nBenefits Administration \xe2\x80\x93 Customer Service \xe2\x80\x93 Cost Control \xe2\x80\x93 Recruiting \xe2\x80\x93 Acquisition Management"]})
df['text_clean'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())
Testing the code with kaggle source file:
# path stores the location of the data file downloaded from kaggle
df = pd.read_csv(path)
# remove the binary 'b' prefix reinstated even though the data is
# read as string during df creation
df['Resume'] = [val[1:].encode('utf-8') for val in df['Resume']]
# create a separate column with multiple decode and encode steps to
# retrieve the final clean version
df['text_clean'] = df['Resume'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').decode('utf-8').strip())
print(df['text_clean'])
Output Sample:
0 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
1 'Name Surname\nAddress\nMobile No/Email\nPERSO...
2 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
3 'www.downloadmela.com\nSatheesh\nEMAIL ID:\nCa...
4 "HUMAN RESOURCES DIRECTOR\nExpert in organizat...
5 'John H. Smith, P.H.R.\n800-991-5187 | PO Box ...
6 'Resume of Satheesh\n\nwww.downlo\nSatheesh\n\...
7 "GM HR & ADMINISTRATION Resume Sample www.time...
8 "www.uaehrzone.com\n\nRobert Wales\nDubai\nUni...
9 "Human Resources Coordinator Resume\nExample\n...
10 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
11 'XXXXX XXXXX\nXXXX, Renton, WA 98059\nHome: XX...
12 'SATHEESH\n\nwww.downloadmela.com\n\nObjective...
13 'Alan Bloggs BE\n1Main Street, Irish Town, Co....
14 'www.downloadmela.com\nSatheesh\nSummary\n4+ y...
15 'Anthony Brown\nHR Assistant\nAREAS OF EXPERTI...
16 'T\n\nAYLOR J ONES\n15 Jinglewood Street Melbo...
17 'Human Resources Manager\nCurriculum Vitae Exa...
18 'EDMONDBRADY\n1900SummersDriveMontello,AZ55996...
19 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nSomet...
20 'Jo Sample\n123 Ocean Drive\nSampleville, FL 1...
21 'Jonathan Burns\n1414 Marcy Drive\n\n\n\nExamp...
22 'Shweta XXX\nMobile: +91-98********\n\nE-mail:...
23 'www.downloadmela.com\n\nSATHEESH\nMobile :\nE...
24 "Steven B. Manning\n3249 Oral Lake Road\nMinne...
25 "www.downloadmela.com\nSatheesh\n\nE-mail:\nHa...
26 "Resume for HR Assistant\nTX\n3 Avenue,\nSale,...
27 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
28 'HOW TO WRITE A PROFESSIONAL RESUME\n\n
RESUM...
29 'RESUME WORLD INC.\n1200 Markham Road, Suite 1...
...
1189 'Joseph Andrade\nACTOR\nEmail: Jfandrade192#ou...
1190 'MARIAH FORD\nHeight: 5 4\nStars Talent Studio...
1191 'Your Name\nPhone number\nEmail address\nHeigh...
1192 'Jarien Sky-Stutts Senior 3D Artist\n\ncontac...
1193 "Gary White\nMake up artist\nAREAS OF EXPERTIS...
1194 'RESUME\nDan Platt\n5134 Oakdale Ave.\nWoodlan...
1195 "Jeff Wolverton, M.S., B.S.Cis.\nVisual FX Art...
1196 'LETA LOU GRAY\t\n\r\n\n\t\n\r\n\n\t\n\r\n\n\t...
1197 'Curriculum Vitae\nPersonal Details\n\nDarren ...
1198 'Your Name\n\nSchool Address\n123 Main Street\...
1199 'Stacy Adams\nSAG/AFTRA\nHeight:\nWeight:\nHai...
1200 'PERFORMING ARTS RESUME\nContent\nA performers...
1201 'ED WEISS\nTeaching Artist Resume\n\ne-mail: W...
1202 '8/23/2016 sample resume for painter. accounti...
1203 'KELSEY PAINTER\nBlonde Hair/Brown Eyes | Alto...
1204 'Wendy Robin\nProfessional Make-up Artist\n(70...
1205 'Chet Bailey\n100 Desert Street\nDrytown, CA 9...
1206 'Chris Flight Attendant\n11223 East South Aven...
1207 'Bilingual Flight Attendant Resume\n\nANGELICA...
1208 'Flight Attendant Resumes\nFlight-Attendant-Ca...
1209 'Emirates Flight Attendant Resume Sample\n\nAn...
1210 'Entry Level Flight Attendant Resume No Exper...
1211 'CURRICULUM VITAE\nMay 11, 2004\n\nNAME\n\nRob...
1212 'JED REDD\n\n_\n003 Boudry Lane\nFriend, TX 77...
1213 'Lauren B. Pires\nMiami, Florida & New York Ci...
1214 "Free Flight Attendant Resume\nDarlene Flint\n...
1215 'Corporate Flight Attendant Resume\nCAITLIN FL...
1216 'MAJOR CONRAD A. PREEDOM\n2354 Fairchild Dr., ...
1217 'STACY SAMPLE\n\n702 800-0000 cell\n\n0000#ema...
1218 'Entry Level Resume Guide\n\nThis packet is in...
Name: text_clean, Length: 1219, dtype: object

Related

Power BI Matrix Visual Showing Row of Blank Values Even Though Source Data Does Not Have Blanks

I have two tables one with data about franchise locations (Franchise Profile Info) and one with Award data. Each franchise location is given a certain number of awards they are allowed to give out per year. Each franchise location rolls up to a larger group depending on where in the country they are located. These tables are in a 1 to 1 relationship using Franchise ID. I am trying to create a matrix with the number of awards, total utilized, and percentage utilized rolled up to group with the ability to expand the groups and see individual locations. For some reason when I add the value fields a blank row is created. There are not any blank rows in either of the original tables so I'm not sure where this is coming from.
Franchise Profile Info table
ID
Franchise Name
Group
Street Address
City
State
164
Park's
West
12 Park Dr.
Los Angeles
CA
365
A & J
East
243 Whiteoak Rd
Stafford
VA
271
Otto's
South
89 Main St.
St. Augustine
FL
Award table
ID
Year
TotalAwards
Utilized
164
2022
16
12
365
2022
5
5
271
2022
22
17
This tables are in a relationship with a 1 to 1 match on ID
What I want the matrix to look like
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
Instead what I'm getting is this
Group
Total Awards
Utilized
%Awards Utilized
East
5
5
100%
West
16
12
75%
South
22
17
77%
0
0
0%
I can't for the life of me figure out where this row is coming from. I can add in the Group and Franchise name as rows but as soon as I add any of the value columns this blank row shows up.
You have a value on the many side that does not exist on the one side. You can read a full explanation here. https://www.sqlbi.com/articles/blank-row-in-dax/

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

Turning my web scrape data into an array

The following code below pulls all the information I want; however, I want it to be sorted into an array so that each phone number is paired with the corresponding name, address, and description. I can't figure out a way to indent it to make it pull all 38 entries. Any help would be appreciated!
#import libraries
from selenium import webdriver
import csv
#driver path
driver = webdriver.Chrome('C:\Python27\Chromedriver\chromedriver.exe')
#fetch top Amsterdam restaurants
driver.get('http://www.eater.com/maps/best-amsterdam-restaurants')
for elem in driver.find_elements_by_xpath('.//h2[span[#class = "c-mapstack__card-index"]]'):
restname = elem.text
for address in driver.find_elements_by_class_name('c-mapstack__address'):
restaddress = address.text
for content in driver.find_elements_by_class_name('c-entry-content'):
restdescrip = content.text
eaterarray = [restname, restaddress, restdescrip]
print eaterarray
I am aware the indenting isn't right, and I've tried several configurations but I can't seem to get it to loop right in any configuration.
First of all i would like to inform you that if you do not want to provide the path of chromedriver in each script, just paste the "chromedriver.exe" under scripts folder of python..i.e,"C:\Python27\Scripts"
try this code, it will solve your problem:
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
#fetch top Amsterdam restaurants
driver.get('http://www.eater.com/maps/best-amsterdam-restaurants')
a=[]
b=[]
c=[]
for elem in driver.find_elements_by_xpath('.//h2[span[#class = "c-mapstack__card-index"]]'):
restname = elem.text.encode('ascii', 'ignore')
a.append(restname)
for address in driver.find_elements_by_class_name('c-mapstack__address'):
restaddress = address.text.encode('ascii', 'ignore').strip()
b.append(restaddress)
for content in driver.find_elements_by_class_name('c-entry-content'):
restdescrip = content.text.encode('ascii', 'ignore').strip()
c.append(restdescrip)
q=[(x,y) for x,y in zip(b, b[1:]) if '+31' in y]
q.insert(21,'Raadhuisstraat Amsterdam, Netherlands')
q.insert(25,'Leidsestraat 94 Amsterdam, North Holland 1017 PE, Netherlands')
d=c[1:]
new_dict= dict((a[i], (d[i],q[i])) for i in range(len(a)))
for k, v in new_dict.iteritems():
print k , v
it will print the output as:
22 Haringstal Ab Kromhout ("Contrary to popular belief, Dutch herring is not raw but salt-cured although the complex curing process does give it a raw finish on the tongue. First of the season herring are called Hollandse nieuwe and are usually available starting in early June. You can find herring stalls all over the city, but Haringstal Ab Kromhout come highly recommended. Order one au naturel or go for the traditional raw chopped onion and pickle accompaniment. Can't make it to Ab Kromhout? Kras Haring on Wittenburgergracht is also an excellent option. [$]", 'Raadhuisstraat Amsterdam, Netherlands')
25 Foodhallen ('Formerly a tram depot, De Foodhallen is now the place to get a taste of the Dutch street food scene. Theres something for everyone here: grilled cheese sandwiches (at Caulils), a bitterbal tasting (at De Ballenbar), burgers (at the Butcher), hotdogs (at Bulls & Dogs), Vietnamese street food (at Viet View), BBQ pork (at the Rough Kitchen), sweet tartlets (at Le Petit Gateau), Mediterranean snacks (at Maza), and lots more. [$$]', ('Bellamyplein 51\n1053 AT Amsterdam, Netherlands', '+31 6 29265037'))
3 Rotisserie Rijsel ('Rijsel serves Flemish and French classics like boeuf la mode, huzarensalade (Russian salad), presskop (head cheese), and rotisserie poussin, all prepared with the finest ingredients. This combined with a well-chosen and well-priced wine selection has put Rijsel on everybodys favourite list since its opening in 2012. Booking ahead is essential and (if on offer) dont think twice about ordering the Cte de Boeuf. [$$$]', ('Marcusstraat 52b\nAmsterdam, NoordHolland 1091TK, Netherlands', '+31 20 463 2142'))
9 Bord'eau - Restaurant Gastronomique ('If you can afford it, head to the two-Michelin-starred Bordeau for the ultimate fine-dining experience. Here, chef Richard van Oostenbrugge wows his guests with his incredibly skilled, classic technique-based cooking. Expect the finest produce, maximum flavors, exquisite sauces, and picture-perfect plates. In fact, Bordeaus signature apple dessert is the most photographed/Instagrammed dessert in Amsterdam. [$$$$]', ('Nieuwe Doelenstraat 2\nAmsterdam, North Holland 1012 CP, Netherlands', '+31 20 531 1705'))
10 Oriental City ('Located in Amsterdams Wallen area, Oriental City is a firm favorite with locals and Amsterdams Chinese community alike. Youll be tempted by much of the extensive menu, and Oriental Citys dim sum is among the best in Amsterdam. The restaurant has many tables divided over two floors, but still be prepared to stand in line on Saturdays. [$$$]', ('Oudezijds Voorburgwal 177-179\nAmsterdam, North Holland 1012 EV, Netherlands', '+31 20 626 8352'))
4 La Rive ('An Amsterdam fine-dining institution since the early 90s (and once the home of renowned Dutch chef Robert Kranenborg), since 2008, Rogr Rassin has been at the helm of La Rives kitchen. Dont be fooled by the traditional, ever-so-slightly formal dining room, because, au contraire, Rassins cooking is deliciously modern and seasonal. The dinner-only restaurant has a unique riverside location, so try to book a window table. [$$$$]', ('Professor Tulpplein 1\nAmsterdam, North Holland 1018 GX, Netherlands', '+31 20 520 3264'))
11 Restaurant Gebr. Hartering ('Part of Amsterdams new wave of casual and unpretentious restaurants, Gebr Hartering has helped shape the citys lively dining scene. The eatery is run by brothers Paul and Niek Hartering and the concept is very simple: hearty food cooked with great ingredients, to be enjoyed with a glass of fine wine. Theres a daily-changing menu, which includes the big-hitter Fleckvieh beef, grilled on charcoal. [$$$]', ('Peperstraat 10I\n1011 TL Amsterdam, Netherlands', '+31 20 421 0699'))
12 Nam Kee ('One of Amsterdams longest-running Chinese restaurants, Nam Kee is best known for its Peking duck window display and famous for its steamed oysters in black bean sauce fantastic oysters that owe their fame to the Dutch film (and novel) Oysters at Nam Kees. [$$$]', ('Zeedijk 111-113\nAmsterdam, North Holland 1012 AV, Netherlands', '+31 20 624 3470'))
15 Gebroeders Niemeijer ('Start your day with a cup of coffee (featuring Costadora beans) and a freshly-baked croissant at Gebr. Niemeijer bakery, or order one of its French-style breakfasts, with petit pains, croissants, marmalade, and jam. At lunchtime, Gebr. Niemeijer serves simple sandwiches and salads. Theres also a great selection of baked goods, and dont miss out on their baguettes (you know, for that Vondelpark picnic). [$]', ('Nieuwendijk 35\nAmsterdam, North Holland, Netherlands', '+31 20 707 6752'))
17 Toscanini ('Toscanini is the most-loved Italian restaurant in Amsterdams Jordaan, and with its 30-year history, probably one of the oldest, too. No pizzas here: Instead, expect a proper (seasonal) Italian menu with a choice of antipasti, primi, secondi, and dolci. Toscanini offers non-fussy food with great ingredients and maximum flavor, served in a wonderfully bustling setting. Its a great dinner spot and theres an excellent wine list, too. [$$$]', ('Lindengracht 75\nAmsterdam, North Holland 1015 KD, Netherlands', '+31 20 623 2813'))
26 FEBO ('Amsterdam is famous for its deep-fried snacks like kroket and bitterballen (both similar to croquettes) and frikandel, a type of sausage. At 75-year-old fast food chain Febo you can buy these snacks from an automat. There are branches scattered all over the city, so it shouldnt be too difficult to get your teeth into a frikandel or a kaassouffl, a pocket of deep-fried cheese. On Fridays and Saturdays some branches are open until 4 a.m., perfect for your wee-hour drunken munchies. [$]', 'Leidsestraat 94 Amsterdam, North Holland 1017 PE, Netherlands')
14 Restaurant Stork ('Hop on the IJplein ferry (near Central Station) for lunch or dinner at Stork, housed in a former Stork engines factory building on the north banks of the river IJ. Order sole or lobster with fries or tuck into a delicious plateau fruit de mer and enjoy the great views of the river. Storks riverside terrace offers a wonderful al fresco dining experience. [$$]', ('Gedempt Hamerkanaal 201\nAmsterdam, North Holland 1021 KP, Netherlands', '+31 20 634 4000'))
19 Proeflokaal Arendsnest ('For a taste of the burgeoning Dutch craft beer scene, get yourself a seat at the bar at Arendsnest. At this canal-side beer bar on the Herengracht, you can try over 30 Dutch beers on tap and no fewer than 100 bottled beers. Youll be spoiled with choices, but do try one of Jopen Brewerys award-winning beers, particularly the Extra Stout, which won a gold medal in the 2015 World Beer Awards. [$]', ('Herengracht 90\nAmsterdam, North Holland 1015 BS, Netherlands', '+31 20 421 2057'))
30 Thrill Grill ('With its first-rate burgers, Thrill Grill has rapidly become a household name for real burger lovers. Thrill Grill is the brainchild of veteran chef Robert Kranenborg, a local legend. The meat is from old Dutch dairy cows and cooked medium-rare. Get your teeth into a classic beef thriller or go for the salmon or veggie falafel burger. The branch on the Gerard Doustraat provides particularly lovely ambiance. [$$]', ('Gerard Doustraat 98\nAmsterdam, North Holland 1072VX, Netherlands', '+31 20 760 6750'))
27 Patisserie Holtkamp ('A family-owned pastry shop where Amsterdam locals go for their sweet treats, expect Patisserie Holtkamp to offer a small but superb range of French and Dutch patisserie, cakes, chocolates, and biscuits (no cupcakes here!). Holtkamp is also famous for its veal, shrimp, and cheese kroketten (croquettes), which are deep-fried to order in the shop. [$]', ('Vijzelgracht 15\nAmsterdam, North Holland 1017, Netherlands', '+31 20 624 8757'))
2 Brouwerij 't IJ ('This Amsterdam brewery has a unique canal-side location, right next to an old windmill, and the outdoor terrace is a popular hangout on sunny days. Around seven beers are available on tap, including the classic Zatte and Natte and often a special seasonal brew, too. A small selection of bar snacks is on offer, including the traditional Dutch Ossenworst, a raw and smoked beef sausage. [$]', ('Funenkade 7\nAmsterdam, North Holland 1018 AL, Netherlands', '+31 20 320 1786'))
24 Fromagerie Kef ('Cheesemonger Fromagerie Abraham Kef supplies many Michelin-starred restaurants in Amsterdam with cheese. The original shop (est. 1953) is on the Marnixstraat, but since 2014, a second branch also operates on the Czaar Peterstraat. On Sundays the Marnixstraat branch regularly organizes cheese and wine tastings. Kefs fantastic cheese selection (mainly made from raw milk) includes some magnificent aged Dutch cheeses. Dont leave without some Remeker. [$]', ('Marnixstraat\nAmsterdam, North Holland 1016 TJ, Netherlands', '+31 20 420 0097'))
18 Cafe De Klepel ('Quality wines and bistro food take the spotlight at Caf De Klepel, part of the recent Dutch bistronomie movement. This friendly and popular place is run by young sommelier duo Margot Los and Job Seuren (formerly of De Librije). Pop in for a glass of wine (at the bar) with some charcuterie or cheese. For the full experience, book a table and order De Klepels three-or four-course menu. [$$]', ('Prinsenstraat 22\nAmsterdam, North Holland 1015 DD, Netherlands', '+31 20 623 8244'))
36 Yamazato ('Yamazato provides an unexpected slice of Japan in the Dutch capital, including dining room views of a Japanese garden with a koi pond. In the evenings, the Michelin-starred Yamazato which is also in the Hotel Okura offers authentic kaiseki tasting menus, but you can also step in for lunch and order a bento box or the great value lunch menu (five courses for 50). An la carte menu including sushi and sashimi is available, too. [$$$-$$$$]', ('Ferdinand Bolstraat 333\nAmsterdam, North Holland 1072 LH, Netherlands', '+31 20 678 8351'))
38 Ron Gastrobar ('Amsterdams thriving dining scene owes a lot to Ron Blaauw. Three years ago, he relaunched his two-star restaurant into the more casual and wallet-friendly Ron Gastrobar, leaving his fine-dining years behind and at the same time launching a new trend. In fact, Michelin Netherlands is even talking about the Ron Blaauw effect. All dishes are priced at 15 (desserts 9) and the restaurant is lauded for its dry-aged barbecue steaks. [$$$]', ('Sophialaan 55 hs\nAmsterdam, North Holland 1075 BP, Netherlands', '+31 20 496 1943'))
13 Choux ('Relatively new on the Amsterdam dining scene but booked solid for dinner every night Choux serves natural wines and light, fresh cuisine, the latter always with a touch of comfort. Order three, four, or seven courses from the monthly-changing menu by chef Merijn van Berlo (including an excellent vegetarian option). For those who fail to snag a seat at dinner, theres also a three- or four- course menu available at lunchtime. [$$$]', ('De Ruijterkade 128\nAmsterdam, North Holland, Netherlands', '+31 6 16512364'))
20 La Perla ('If youre in the mood for a pizza, La Perla in the Jordaan is the place to go. The restaurant is split in two, with the pizzeria on one side of the street, and the huge wood-fired oven on the other. Try the classic Margherita (with buffalo mozzarella) or order the special porchetta di Ariccia, made with oven-roasted pork. [$$]', ('Tweede Tuindwarsstraat 14 & 53\nAmsterdam, North Holland 1015 RZ, Netherlands', '+31 20 624 8828'))
6 Slagerij de Leeuw ('Head to this butchers shop/deli if youre planning to cook a meal in your rental apartment. De Leeuw, the only gourmet butch shop in Amsterdam, offers a wide range of top-quality fresh meat and poultry, such as Wagyu and Rubia Gallega beef, Iberico pork, and Bresse chicken. But for your gourmet picnic, theres also a great selection of charcuterie, cold meats, pats, and other ready-made delicacies. [$$]', ('Utrechtsestraat 92\nAmsterdam, North Holland 1017 VS, Netherlands', '+31 20 623 0235'))
7 Librije's Zusje Amsterdam ('Literally the young sibling (zusje means little sister) of Jonnie Boers three-star restaurant De Librije in Zwolle, Librijes Zusje is located in the stunning Waldorf Astoria Hotel. Executive chef and De Librije alumnus Sidney Schutte has a modern and cutting-edge style of cooking, which shines through in all the dishes. The tasting menu has a hefty price tag, but its worth every cent. [$$$$]', ('Herengracht 542-556\nAmsterdam, North Holland 1017 CG, Netherlands', '+31 20 718 4643'))
34 Twenty Third Bar ('The Dutch cocktail scene is small but growing. The best option, if only for the amazing views, is Twenty Third Bar, situated on the 23rd floor of the Hotel Okura. The extensive cocktail list primarily features classics priced at 15 (champagne cocktails 19.50), and theres a small bar snack menu. Okuras notoriously expensive two-Michelin-starred restaurant Ciel Bleu is located on the same floor. [$$]', ('Ferdinand Bolstraat 333\nAmsterdam, North Holland, Netherlands', '+31 20 678 7450'))
16 Gs ('This funky place is an ideal spot for an American-style brunch which has grown increasingly popular here in recent years and provides respite for those in desperate need of a hangover Bloody Mary. Gs serves a full range of egg dishes; its chicken waffle burger is somewhat famous among locals. The Bloody Mary menu offers no fewer than 13 different versions. Gs has two branches. Consider booking a seat online in advance.', ('Goudsbloemstraat 91\nAmsterdam, North Holland 1015 JK, Netherlands', '+31 20 362 0030'))
8 Guts and Glory ('Guts & Glory a lively, stripped-down place just off Rembrandt Square opened by the super-talented chefs Guillaume de Beer and Freek van Noortwijk and their partner Johanneke van Iwaarden is one of the hottest places to eat in Amsterdam. Its signature is the single-ingredient menu called chapter, which changes every two to three months. After Chicken, Fish, Beef, Pork, and Vegetarian, de Beer and van Noortwijk will soon embark on chapter six: Italian. [$$-$$$]', ('Utrechtsestraat 6\nAmsterdam, North Holland, Netherlands', '+31 20 362 0030'))
31 Le Garage ('This iconic restaurant was founded by restaurateur Joop Braakhekke in 1990 in a former garage. Its famous for being a celebrity haunt, but perhaps equally famous for its dramatic red and black decor that hasnt changed since opening. Le Garage has a heavily French influenced menu (steak tartare, canard la presse, le flottante), but theres also room for modern dishes (squid carbonara, tuna pizza). [$$$]', ('Ruysdaelstraat 54-56\n1071 XE Amsterdam, Netherlands', '+31 20 679 7176'))
23 Broodjeszaak t Kuyltje ('Leisurely lunches are not part of everyday life in the Netherlands, but the Dutch do like a good sandwich, preferably on the go. The best place to get a taste of a Dutch-style sandwich is t Kuyltje. People queue up for its pastrami sandwich, but equally delicious is the Tartaar Speciaal (minced raw beef, onion, hardboiled egg) or the Halfom sandwich (half corned beef, half liver). [$]', ('Gasthuismolensteeg 9\nAmsterdam, North Holland 1016 AM, Netherlands', '+31 20 620 1045'))
35 BAK restaurant ('Bak is a pop-up turned brick-and-mortar restaurant located on the banks of the river IJ in Amsterdams recently re-developed Westelijk Havengebied area. In short: Expect serious food and serious wine, served in a laid-back setting. The menu reflects chef Benny Blistos love for seasonal and local ingredients, and on the wine list you can expect natural wines and quirky grape varieties. For lunch, BAK offers a very affordable three-course menu for 27. [$$]', ('Van Diemenstraat 410\nAmsterdam, North Holland 1013 CR, Netherlands', '+31 20 737 2553'))
21 Restaurant Breda ('A more upscale restaurant by game-changing chefs Guillaume de Beer and Freek van Noortwijk (compared to their other restaurant Guts & Glory, anyway), Bredas opening was greeted by widespread critical acclaim. Dishes are modern with creative flavor combinations, and you can taste the ambition of these young chefs. Sit down for dinner and order the Basic, Extra, or Full Monty tasting menu, and enjoy fine wines selected by sommelier Johanneke van Iwaarden. Its open daily for lunch and dinner. [$$$]', ('Singel 210\nAmsterdam, North Holland, Netherlands', '+31 20 622 5233'))
37 Restaurant Blauw ('Restaurant Blauw is an Indonesian spot renowned for its rijsttafel, which is the thing to order. Rijsttafel is a table-filling feast of small dishes, rice, and condiments, a hybrid Dutch-Indonesian tradition that originated during the Dutch colonial era. Theres a vegetarian rijsttafel option, and you can also order more traditional Indonesian dishes from the a la carte menu. Arrive hungry! [$$]', ('Amstelveenseweg 158-160\nAmsterdam, North Holland 1075 XN, Netherlands', '+31 20 675 5000'))
32 Par Hasard ('Herring and cheese aside, most people also think of fries when thinking of Amsterdam. The Belgian-style double-baked fries at Par Hasard (meaning: by accident) are regarded by many as the best fries in town. Grab an order with a traditional topping of mayonnaise, satay sauce, or zoervleis (a type of beef stew). [$]', ('Ceintuurbaan 113-115\nAmsterdam, North Holland 1072 EZ, Netherlands', '+31 20 471 4052'))
29 Conservatorium Brasserie & Lounge ('With its immense floor-to-ceiling windows and glass ceiling, this is hands-down the most impressive lobby-cum-all-day-dining-room in Amsterdam an essential part of the total experience at this cosmopolitan hotel. Enjoy drinks and snacks in the lounge area or go to the brasserie for lunch or dinner. Standout dishes include veal cheeks with mac and cheese, lobster au gratin, and apple crumble. Theres also a selection of sandwiches and steaks. [$$$]', ('Van Baerlestraat 27\nAmsterdam, North Holland 1070 LP, Netherlands', '+31 20 570 0000'))
1 Merkelbach ("Located in a former 18th-century coach house, Merkelbach's spectacular garden is hands-down the best outdoor dining experience in Amsterdam. The restaurant prides itself on following the principles of the Slow Food movement, so expect a seasonal menu with local ingredients. During the day you can walk in for coffee and apple pie, and theres a compact lunch menu. [$$]", ('Middenweg 72\nAmsterdam, North Holland 1097 BS, Netherlands', '+31 20 423 3930'))
28 Rijks at the Rijksmuseum ('Rijks brings a fresh approach to museum dining (its housed in the Rijksmuseum of Dutch art and history). On the menu designed by chef Joris Bijdendijk (formerly of the three-Michelin-starred Le Jardin de Sens and the one-starred Bridges) and his team find inventive small plates. Also featured are dishes by guest-chefs who cooked at Rijks, like Andr Chiang and Tim Raue. Definitely order the spit-roasted celeriac. [$$$]', ('Museumstraat 1\nAmsterdam, North Holland 1071 XX, Netherlands', '+31 20 674 7000'))
33 The Fat Dog ('For hot dogs, look no further than the Fat Dog, Amsterdams first-ever hot dog joint, opened by acclaimed chef/restaurateur Ron Blaauw in 2014. Order an all-pork frank with sauerkraut, mustard, and onion marmalade (called Gangs of New York) or go for the chicken Gado Gado hot dog with satay sauce, cabbage, and serundeng (spiced coconut flakes). Innovation doesnt stop there: The lamb dog comes with baba ganoush, and theres also a veggie dog. [$]', ('Ruysdaelkade 251\nAmsterdam, North Holland 1072 AX, Netherlands', '+31 20 221 6249'))
5 Patisserie Kuyt ('Follow locals and food obsessives from near and far to this fabulous patisserie for the finest pies, cakes, chocolates, biscuits, and eclairs. Kuyt also has a good selection of delicate savory pastries, quiches, and biscuits. The choice is overwhelming, but dont leave without a Appelschnitt, or better yet, enjoy any of the beautiful and delicious baked goods in the tea room. [$]', ('Utrechtsestraat 109\nAmsterdam, North Holland 1017 VL, Netherlands', '+31 20 623 4833'))
hope this is what you want

Self Join in Pandas: Merge all rows with the equivalent multi-index

I have one dataframe in the following form:
df = pd.read_csv('data/original.csv', sep = ',', names=["Date", "Gran", "Country", "Region", "Commodity", "Type", "Price"], header=0)
I'm trying to do a self join on the index Date, Gran, Country, Region producing rows in the form of
Date, Gran, Country, Region, CommodityX, TypeX, Price X, Commodity Y, Type Y, Prixe Y, Commodity Z, Type Z, Price Z
Every row should have all the different commodities and prices of a specific region.
Is there a simple way of doing this?
Any help is much appreciated!
Note: I simplified the example by ignoring a few attributes
Input Example:
Date Country Region Commodity Price
1 03/01/2014 India Vishakhapatnam Rice 25
2 03/01/2014 India Vishakhapatnam Tomato 30
3 03/01/2014 India Vishakhapatnam Oil 50
4 03/01/2014 India Delhi Wheat 10
5 03/01/2014 India Delhi Jowar 60
6 03/01/2014 India Delhi Bajra 10
Output Example:
Date Country Region Commodit1 Price1 Commodity2 Price2 Commodity3 Price3
1 03/01/2014 India Vishakhapatnam Rice 25 Tomato 30 Oil 50
2 03/01/2014 India Delhi Wheat 10 Jowar 60 Bajra 10
What you want to do is called a reshape (specifically, from long to wide). See this answer for more information.
Unfortunately as far as I can tell pandas doesn't have a simple way to do that. I adapted the answer in the other thread to your problem:
df['idx'] = df.groupby(['Date','Country','Region']).cumcount()
df.pivot(index= ['Date','Country','Region'], columns='idx')[['Commodity','Price']]
Does that solve your problem?

Space Formatting data to csv

For quite some time I have been trying to format space separated data to a CSV structure.
Initial position
The initial data table is given by:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
It contains lots of spaces and unnecessary information throughout. The information is present somewhat like this
Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
I want to convert it to the following format
Doctor's name,Specialization,Hospital name,Address,Fees,Schedule
So the current data should look like this
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
Till now I have succeeded in removing the Book Appointment field.
Problem
However I am facing difficulties in classifying the hospital's name. As the spacing in it varies a lot. Is this problem feasible?
EDIT
The output of cat -A file is the following:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
There's no straightforward way to separate the specialization from the hospital name, but with some assumptions, you could perhaps use perl to do this:
perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file
Gives:
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM
And since it's perl based regex, you can use regex101 to get a glimpse of how it works through the regex debugger. The regex is quite straightforward, but the fact that there are many parts can make it appear daunting.
Warning: The above is able to separate the specialization based on two things:
It tries to find the first occurrence of space followed by two uppercase characters or digits and starts matching as the hospital name when it finds it; or
If there are no consecutive uppercase characters or digits, it takes only the first word as the specialization and the rest as the hospital name.
I know it might not solve the complete problems as there are always lines that won't fit the above rules, but that can get you started on cleaning these up. If there is anything incorrectly separated (i.e. when the specialization consists of more than 1 word and the hospital name doesn't have two consecutive upper/digit) you will have one word of the specialization correctly placed, and the rest in the hospital name.
Unfortunately, based on your input, there's no way to separate specialisation with hospital name. The other fields can be captured, albeit inelegantly and with gawk (probably >= 4.0, but I think 3.x should work):
$ awk -F" \t " -v OFS="," -v S=" " '
{
sub(/\s+$/, "");
split($2, Data, /[ ,]{2,}/);
Address = Data[1];
split($2, Data, / +/);
nData = length(Data);
Schedule = Data[nData - 2];
Fees = Data[nData - 4] S Data[nData - 3];
split($1, Data, / +/);
Name = Data[1] S Data[2] S Data[3]; # assume all names are Dr. Xxx Xxx only
match($1, /[0-9]+ years experience /);
SpecializationHospital = substr($1, RSTART + RLENGTH);
print Name, SpecializationHospital, Address, Fees, Schedule;
} ' data.txt
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM