Find the range of year in pandas especially with hyphen formats?

Find the range of year in pandas especially with hyphen formats? - regex

Given the data below, I want to print the list of team who debut their match between 1934 to 1948. Since the Debut column is object, I am not able to get the column data in integer form.
Team Debut
0 Real Madrid 1929
1 Barcelona 1929
2 Atletico Madrid 1929
3 Valencia 1931-32
4 Athletic Bilbao 1929
5 Sevilla 1934-35
6 Espanyol 1929
7 Real Sociedad 1929
8 Zaragoza 1939-40
9 Real Betis 1932-33
10 Deportivo La Coruna 1941-42
11 Celta Vigo 1939-40
12 Valladolid 1948-49
Can somebody please help to give an idea how to achieve it?
Thanks in advance

You can use str.extract to extract first part of the date and check if its in the required range
mask = df['Debut'].str.extract('(\d+)')[0].astype(int).between(1934,1948)
df[mask]
Team Debut
5 5 Sevilla 1934-35
8 8 Zaragoza 1939-40
10 10 Deportivo La Coruna 1941-42
11 11 Celta Vigo 1939-40
12 12 Valladolid 1948-49

If only the first year of the range counts, you could use between after converting to a numeric value:
year = pd.to_numeric(df.Debut.str.split('-').str[0])
teams = df.Team[year.between(1934, 1948)]
print(teams)
Output
5 Sevilla
8 Zaragoza
10 Deportivo La Coruna
11 Celta Vigo
12 Valladolid
Name: Team, dtype: object

Related

Google Sheets formula for summing/averaging with specific conditions

I am hoping for a formula to take hours from the name columns and sum/average them by week, into a separate table like the 2nd one below. The formulas need to update upon changing the start and end week cells.
Body Part
Start Week
End Week
Arnold (hours)
Usain (hours)
Bob (hours)
Arms
1
3
6
3
0
Legs
1
6
12
36
20
Chest
2
4
6
2
2
Booty
4
6
9
12
3
Core
1
5
10
5
5
Formula Needed:
Hours
Arnold
Usian
Bob
Week 1
6
8
4.33
Week 2
8
8.67
5
Week 3
8
8.67
5
Week 4
9
11.67
6
Week 5
7
11
5.33
Week 6
5
10
4.33
Bonus if there is a way to also quickly average hours by body parts if for example there are multiple Arms rows.

try:
=ARRAYFORMULA(LAMBDA(a, b, QUERY(SPLIT(FLATTEN(BYCOL(D1:F1, LAMBDA(xx, FLATTEN(IF(
IF(a>=SEQUENCE(1, MAX(a)), "Week "&TEXT(SEQUENCE(1, MAX(a))+b, "00"), )="",,
REGEXEXTRACT(OFFSET(xx,,,1), "(.+) \(")&"×"&
IF(a>=SEQUENCE(1, MAX(a)), "Week "&TEXT(SEQUENCE(1, MAX(a))+b, "00"), )&"×"&
QUERY({REGEXEXTRACT(OFFSET(xx,,,1), "(.+) \("); OFFSET(xx,1,,9^9)/(a)}, "offset 1", )))))), "×"),
"select Col2,sum(Col3) where Col3>0 group by Col2 pivot Col1"))
(C2:INDEX(C:C, MAX(ROW(C:C)*(C:C<>"")))-B2:INDEX(B:B, MAX(ROW(B:B)*(B:B<>"")))+1,
B2:INDEX(B:B, MAX(ROW(B:B)*(B:B<>"")))-1))

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?

You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

Sum 5 rows at a time in an ordered SAS table with no unique identifier using proc sql

I'm working with a SAS table where I have ordered data that I need to sum in intervals of 5. I don't have a unique ID I can use for the group by statement and I'm struggling to find a solution.
Say I have this table
Number Name X Y
1 Susan 2 1
2 Susan 3 3
3 Susan 3 3
4 Susan 4 1
5 Susan 1 2
6 Susan 1 1
7 Susan 1 1
8 Susan 2 4
9 Susan 1 5
10 Susan 4 2
1 Steve 2 4
2 Steve 2 3
3 Steve 1 2
4 Steve 3 5
5 Steve 1 1
6 Steve 1 3
7 Steve 2 3
8 Steve 2 4
9 Steve 1 1
10 Steve 1 1
I'd want the output to look like
Number Name X Y
1-5 Susan 13 10
6-10 Susan 9 13
1-5 Steve 9 15
6-10 Steve 7 12
Is there an easy way to get output like this using proc sql? Thanks!

Try this:
proc sql;
select ceil(Number/5) as Grouping, Name, sum(X), sum(Y)
from have
group by Name, Grouping;
quit;

how to create combinatorial combination of two files

I did some research but i have difficulties finding an answer.
I am using python 2.7 and pandas so far but i am still learning.
I have two CSVs, let say it's the alphabet A-Z in one and digits in the second one, 0-100.
I want to merge the two files to have A0 to A100 up through Z.
For information the two files have DNA sequence so i believe they are strings.
I tried to create arrays with numpy and create a matrix but to no available..
here is a preview of the files:
barcode
0 GGAAGAA
1 CCAAGAA
2 GAGAGAA
3 AGGAGAA
4 TCGAGAA
5 CTGAGAA
6 CACAGAA
7 TGCAGAA
8 ACCAGAA
9 GTCAGAA
10 CGTAGAA
11 GCTAGAA
12 GAAGGAA
13 AGAGGAA
14 TCAGGAA
659
barcode
0 CGGAAGAA
1 GCGAAGAA
2 GGCAAGAA
3 GGAGAGAA
4 CCAGAGAA
5 GAGGAGAA
6 ACGGAGAA
7 CTGGAGAA
8 CACGAGAA
9 AGCGAGAA
10 TCCGAGAA
11 GTCGAGAA
12 CGTGAGAA
13 GCTGAGAA
14 CGACAGAA
1995

I am putting here the way i found to do it, there might be a sexier way:
index = pd.MultiIndex.from_product([df8.barcode, df7.barcode], names = ["df8", "df7"])
df = pd.DataFrame(index = index).reset_index()
def concat_BC(x):#concatenate the two sequences into one new column
return str(x["df8"]) + str(x["df7"])
df["BC"] = df.apply(concat_BC, axis=1)
– Stephane Chiron

Finding the max(latest) date out of a column of dates then grouping them by employee

Importing the data frame
df = pd.read_csv("C:\\Users")
Printing the list of employees usernames
print (df['AssignedTo'])
Returns:
Out[4]:
0 vaughad
1 channln
2 stalasi
3 mitras
4 martil
5 erict
6 erict
7 channln
8 saia
9 channln
10 roedema
11 vaughad
Printing The Dates
Returns:
Out[6]:
0 2015-11-05
1 2016-05-27
2 2016-04-26
3 2016-02-18
4 2016-02-18
5 2015-11-02
6 2016-01-14
7 2015-12-15
8 2015-12-31
9 2015-10-16
10 2016-01-07
11 2015-11-20
Now I need to collect the latest date per employee?
I have tried:
MaxDate = max(df.FilledEnd)
But this just returns one date for all employees.
So we see multiple employees in the data set with different dates, in a new column named "LatestDate" I need the latest date that corresponds to the employee, so for "vaughad" in a new column it would return "2015-11-20" for all of "vaughad" records and in the same column for username "channln" it would return "2016-5-27" for all of "channln" latest dates.

You need to group your data first, using DataFrame.groupby(), after which you can produce aggregate values, like the maximum date in the FilledEnd series:
df.groupby('AssignedTo')['FilledEnd'].max()
This produces a series, with AssignedTo as the index, and the latest date for each of those employees as the values:
>>> df.groupby('AssignedTo')['FilledEnd'].max()
AssignedTo
channln 2016-05-27
erict 2016-01-14
martil 2016-02-18
mitras 2016-02-18
roedema 2016-01-07
saia 2015-12-31
stalasi 2016-04-26
vaughad 2015-11-20
Name: FilledEnd, dtype: object
If you wanted to add those max dates values back to the dataframe, use groupby(...).transform() with numpy.max instead, so you get a series with the same indices:
df['MaxDate'] = df.groupby('AssignedTo')['FilledEnd'].transform(np.max)
This adds in a MaxDate column:
AssignedTo FilledEnd MaxDate
0 vaughad 2015-11-05 2015-11-20
1 channln 2016-05-27 2016-05-27
2 stalasi 2016-04-26 2016-04-26
3 mitras 2016-02-18 2016-02-18
4 martil 2016-02-18 2016-02-18
5 erict 2015-11-02 2016-01-14
6 erict 2016-01-14 2016-01-14
7 channln 2015-12-15 2016-05-27
8 saia 2015-12-31 2015-12-31
9 channln 2015-10-16 2016-05-27
10 roedema 2016-01-07 2016-01-07
11 vaughad 2015-11-20 2015-11-20

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find the range of year in pandas especially with hyphen formats? - regex

Related

Google Sheets formula for summing/averaging with specific conditions

cumulative average powerbi by month

Sum 5 rows at a time in an ordered SAS table with no unique identifier using proc sql

how to create combinatorial combination of two files

Finding the max(latest) date out of a column of dates then grouping them by employee

Categories

Resources