Related
I recently started Pyspark and I'm trying to figure out the regex matching.
For the regexes I've created a list and if one of these items in the list is found in the name column, the added column must be true. This Regex matching must not be case sensitive as seen in the example below.
I have a Table with the following format:
seqno
name
1
john jones
2
John Jones
3
John Stones
4
Mary Wild
5
William Wurt
6
steven wurt
I need to change the Table above to the format of the Table below. This is just a small part of the actual table so hard coding is not going to cut it unfortunately.
seqno
name
regex
1
john jones
True
2
John Jones
True
3
John Stones
True
4
Mary Wild
False
5
William Wurt
True
6
steven wurt
True
Here is the code to create part of the Table:
regex_list = [john, wurt]
columns = ['seqno', 'name']
data = [('1', 'john jones'),
('2', 'John Jones'),
('3', 'John Stones'),
('4', 'Mary Wild'),
('5', 'William Wurt'),
('6', 'steven wurt')]
df = spark.createDataFrame(data=data, schema=columns)
I've been trying numerous applications with .isin and .rlike but can't seem to make it work. Any help would be gladly appreciated.
Thanks in advance!
Use rlike to check if any of the listed regex are like names. can change case in both list and column while test happens Code beloow
df.withColumn('regex',upper(col('name')).rlike(('|').join([x.upper() for x in regex_list]))).show()
+-----+------------+-----+
|seqno| name|regex|
+-----+------------+-----+
| 1| john jones| true|
| 2| John Jones| true|
| 3| John Stones| true|
| 4| Mary Wild|false|
| 5|William Wurt| true|
| 6| steven wurt| true|
+-----+------------+-----+
I know how to use the GROUP BY clause in the QUERY function with either a single or multiple fields. This can return the single row per grouping with the maximum value for one of the fields.
This page explains it nicely using these queries and image:
=query({A2:B10},"Select Col1,min(Col2) group by Col1",1)
=query({A14:C22},"Select Col1,Col2,min(Col3) group by Col1,Col2",1)
However, what if I only want a query that returns the corresponding values for the most recent row, grouped by multiple fields? Is there a query that can do this?
Example
Source Table
created_at
first_name
last_name
email
address
city
st
zip
amount
4/12/2022 19:15:00
Ava
Anderson
ava#domain.com
123 Main St
Anytown
IL
12345
1.00
8/30/2022 21:38:00
Brooklyn
Brown
bb#domain.com
234 Lake Rd
Baytown
CA
54321
2.00
2/12/2022 16:58:00
Ava
Anderson
ava#new.com
123 Main St
Anytown
IL
12345
3.00
4/28/2022 01:41:00
Brooklyn
Brown
brook#acme.com
456 Ace Ave
Bigtown
NY
23456
4.00
5/03/2022 17:10:00
Brooklyn
Brown
bb#domain.com
234 Lake Rd
Baytown
CA
54321
5.00
Desired Query Result
Group by first_name, last_name, address, city, st, and zip, but return the created_at, email, and amount for the maximum (most recent) value of created_at:
created_at
first_name
last_name
email
address
city
st
zip
amount
4/12/2022 19:15:00
Ava
Anderson
ava#domain.com
123 Main St
Anytown
IL
12345
1.00
8/30/2022 21:38:00
Brooklyn
Brown
bb#domain.com
234 Lake Rd
Baytown
CA
54321
2.00
4/28/2022 01:41:00
Brooklyn
Brown
brook#acme.com
456 Ace Ave
Bigtown
NY
23456
4.00
Is such a query possible in Google Sheets?
Use this formula
=QUERY({QUERY(A1:I, " Select max(A),min(B),min(C),min(D),min(E),min(F),min(G),min(H),min(I) Group by B,C,E,F,G,H ", 1)},
" Select * Where Col1 is Not null ")
I believe that this is the formula you need:
=ARRAY_CONSTRAIN(SORTN(SORT(
QUERY({A1:I9,INDEX(IFERROR(REGEXEXTRACT(D1:D9,"(\D+)#")))},
"where Col2 is not null"),
10,1,1,0),9^9,2,10,1),9^9,9)
(Do adjust the formula according to your ranges and locale)
For the formula to work we create the helper column
INDEX(IFERROR(REGEXEXTRACT(D1:D9,"(\D+)#"))).
We also use 9^9 which equals to 387420489 rows, making sure that all rows are included in our sorting calculations.
Finally in our ARRAY_CONSTRAIN function we return the first 9 columns discarding the 10th helper column.
Functions used:
REGEXEXTRACT
IFERROR
INDEX
QUERY
SORT
SORTN
ARRAY_CONSTRAIN
I have data like this along with other columns in a pandas df.
Apologies I haven't figured out how to present the question with code for the dataframe. First Post
Location:
- Tokyo, Japan
- Sacramento, USA
- Mexico City, Mexico
- Mexico City, Mexico
- Colorado Springs, USA
- New York, USA
- Chicago, USA
Does anyone know how I could isolate the country name from the location and create a new column with just the Country Name?
Try this:
In [29]: pd.DataFrame(df.Location.str.split(',',1).tolist(), columns = ['City','Country'])
Out[29]:
City Country
0 Tokyo Japan
1 Sacramento USA
2 Mexico City Mexico
3 Mexico City Mexico
4 Colorado Springs USA
5 Seoul South Korea
You can do this without any regular expressions - you can find the String.indexOf(“, “) to find the position of the seperator in the String, and then use String.substring to cut the String down to just this section.
However, a regular expression can also do this easily, but would likely be slower.
I am using this dataframe:
Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15
I would like to aggregate this by Name and then by Fruit to get a total number of Fruit per Name. For example:
Bob,Apples,16
I tried grouping by Name and Fruit but how do I get the total number of Fruit?
Use GroupBy.sum:
df.groupby(['Fruit','Name']).sum()
Out[31]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
To specify the column to sum, use this: df.groupby(['Name', 'Fruit'])['Number'].sum()
Also you can use agg function,
df.groupby(['Name', 'Fruit'])['Number'].agg('sum')
If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
Fruit Name Number
Apples Bob 16
Apples Mike 9
Apples Steve 10
Grapes Bob 35
Grapes Tom 87
Grapes Tony 15
Oranges Bob 67
Oranges Mike 57
Oranges Tom 15
Oranges Tony 1
As seen in the other answers:
df.groupby(['Fruit','Name'])['Number'].sum()
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
Both the other answers accomplish what you want.
You can use the pivot functionality to arrange the data in a nice table
df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)
Name Bob Mike Steve Tom Tony
Fruit
Apples 16.0 9.0 10.0 0.0 0.0
Grapes 35.0 0.0 0.0 87.0 15.0
Oranges 67.0 57.0 0.0 15.0 1.0
df.groupby(['Fruit','Name'])['Number'].sum()
You can select different columns to sum numbers.
A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.
df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})
using your values...
df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})
You can set the groupby column to index then using sum with level
df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Oranges Bob 67
Tom 15
Mike 57
Tony 1
Grapes Bob 35
Tom 87
Tony 15
You could also use transform() on column Number after group by. This operation will calculate the total number in one group with function sum, the result is a series with the same index as original dataframe.
df['Number'] = df.groupby(['Fruit', 'Name'])['Number'].transform('sum')
df = df.drop_duplicates(subset=['Fruit', 'Name']).drop('Date', 1)
Then, you can drop the duplicate rows on column Fruit and Name. Moreover, you can drop the column Date by specifying axis 1 (0 for rows and 1 for columns).
# print(df)
Fruit Name Number
0 Apples Bob 16
2 Apples Mike 9
3 Apples Steve 10
5 Oranges Bob 67
6 Oranges Tom 15
7 Oranges Mike 57
9 Oranges Tony 1
10 Grapes Bob 35
11 Grapes Tom 87
14 Grapes Tony 15
# You could achieve the same result with functions discussed by others:
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].sum())
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].agg('sum'))
There is an official tutorial Group by: split-apply-combine talking about what you can do after group by.
If you want the aggregated column to have a custom name such as Total Number, Total etc. (all the solutions on here results in a dataframe where the aggregate column is named Number), use named aggregation:
df.groupby(['Fruit', 'Name'], as_index=False).agg(**{'Total Number': ('Number', 'sum')})
or (if the custom name doesn't need to have a white space in it):
df.groupby(['Fruit', 'Name'], as_index=False).agg(Total=('Number', 'sum'))
this is equivalent to SQL query:
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
Speaking of SQL, there's pandasql module that allows you to query pandas dataFrames in the local environment using SQL syntax. It's not part of Pandas, so will have to be installed separately.
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
""")
You can use dfsql
for your problem, it will look something like:
df.sql('SELECT fruit, sum(number) GROUP BY fruit')
https://github.com/mindsdb/dfsql
here is an article about it:
https://medium.com/riselab/why-every-data-scientist-using-pandas-needs-modin-bringing-sql-to-dataframes-3b216b29a7c0
You can use reset_index() to reset the index after the sum
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
or
df.groupby(['Fruit','Name'], as_index=False)['Number'].sum()
For quite some time I have been trying to format space separated data to a CSV structure.
Initial position
The initial data table is given by:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
It contains lots of spaces and unnecessary information throughout. The information is present somewhat like this
Doctor's name | Degree | Years of experience | Specialization | Hospital name | Address | Fees | Schedule | and an unnecessary book appointment field.
I want to convert it to the following format
Doctor's name,Specialization,Hospital name,Address,Fees,Schedule
So the current data should look like this
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM
Till now I have succeeded in removing the Book Appointment field.
Problem
However I am facing difficulties in classifying the hospital's name. As the spacing in it varies a lot. Is this problem feasible?
EDIT
The output of cat -A file is the following:
Dr. Arun Raykar MBBS, MS - ENT 9 years experience Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE ^I Malleswaram, Bangalore INR 250 MON-SAT7:00PM-9:00PM Book Appointment $
Dr. Hema Sanath C BHMS, CFN 0 years experience Homeopath Sankirana Homeopathic Clinic ^I Kalyan Nagar, Bangalore INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM Book Appointment $
Dr. Hema Ahuja BDS,M Phil 33 years experience Dentist V2 E City Family Dental Center ^I Electronics City, Bangalore INR 200 MON-SUN10:00AM-8:00PM Book Appointment
There's no straightforward way to separate the specialization from the hospital name, but with some assumptions, you could perhaps use perl to do this:
perl -pe 's/^(\S+\s+\S+\s+\S+).+experience\s([^\t]+?)\s+(\b[A-Z0-9]{2}[^\t]+?|(?:(?!\b[A-Z0-9]{2})[^\t])*)\s+\t\s+([^,]+,).+?(INR.+?PM)\s+.*/\1,\2,\3,\4\5/' file
Gives:
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist,SHAKTHI E.N.T CARE,Malleswaram,INR 250 MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath,Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250 MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist,V2 E City Family Dental Center,Electronics City,INR 200 MON-SUN10:00AM-8:00PM
And since it's perl based regex, you can use regex101 to get a glimpse of how it works through the regex debugger. The regex is quite straightforward, but the fact that there are many parts can make it appear daunting.
Warning: The above is able to separate the specialization based on two things:
It tries to find the first occurrence of space followed by two uppercase characters or digits and starts matching as the hospital name when it finds it; or
If there are no consecutive uppercase characters or digits, it takes only the first word as the specialization and the rest as the hospital name.
I know it might not solve the complete problems as there are always lines that won't fit the above rules, but that can get you started on cleaning these up. If there is anything incorrectly separated (i.e. when the specialization consists of more than 1 word and the hospital name doesn't have two consecutive upper/digit) you will have one word of the specialization correctly placed, and the rest in the hospital name.
Unfortunately, based on your input, there's no way to separate specialisation with hospital name. The other fields can be captured, albeit inelegantly and with gawk (probably >= 4.0, but I think 3.x should work):
$ awk -F" \t " -v OFS="," -v S=" " '
{
sub(/\s+$/, "");
split($2, Data, /[ ,]{2,}/);
Address = Data[1];
split($2, Data, / +/);
nData = length(Data);
Schedule = Data[nData - 2];
Fees = Data[nData - 4] S Data[nData - 3];
split($1, Data, / +/);
Name = Data[1] S Data[2] S Data[3]; # assume all names are Dr. Xxx Xxx only
match($1, /[0-9]+ years experience /);
SpecializationHospital = substr($1, RSTART + RLENGTH);
print Name, SpecializationHospital, Address, Fees, Schedule;
} ' data.txt
Dr. Arun Raykar,Ear-Nose-Throat (ENT) Specialist SHAKTHI E.N.T CARE,Malleswaram,INR 250,MON-SAT7:00PM-9:00PM
Dr. Hema Sanath,Homeopath Sankirana Homeopathic Clinic,Kalyan Nagar,INR 250,MON-SAT10:00AM-2:00PM6:30PM-8:00PM
Dr. Hema Ahuja,Dentist V2 E City Family Dental Center,Electronics City,INR 200,MON-SUN10:00AM-8:00PM