Pandas - Best way to deal with this data - python-2.7

I hopefully have a very simple question. My company uses a employee shift management software known as humanity which produces hour reports in a really unuseable format.
I need to clean it up so I can apply it in the rest of my analysis but I am at a loss at the best way to do this and I can't figure it out. The data starts looking like this:
Name | Total | Start (Sep 1, 2017) | End (Sep 1, 2017) | Hrs (Sep 1, 2017)
User 1 | 12 | 06:00 | 18:30 | 13
User 2 | 0 | | |
There are obviously many more users and many more dates but it repeats across the columns for additional dates. Here is what I have done to clean it up so far:
data = pd.read_csv("TestReport.csv")
del data["Total"]
cols = [c for c in data.columns if c.lower()[:3] != 'hrs']
data = data[cols]
data.rename(columns=lambda x: re.sub('Start \(', '', x), inplace=True)
data.rename(columns=lambda x: re.sub('End \(', '', x), inplace=True)
data.rename(columns=lambda x: re.sub('\)', '', x), inplace=True)
data.fillna(0, inplace=True)
My end need is to create date fields for start and finish times for each day for each user. With my data now having the column names as a pure month, day, year I think the best way would be just to iterate over each row and add the row value + column name, convert to date time and that will work.
However, I am not positive the best way to go about doing this, or if this would even be the best way.
The most important thing for me is that each user has a combined start and finish date time to be able to use to further analyze their efficiency during their shift on different records.
Let me know if I can provide any more details,
Thank you!
Andy McMaster
*******************Edited to show example*********************
Ideally end goal is to create a series of date ranges for each user. I need to be able to compare these series to my dataframe which holds records for all employees work, then assign each record to the user (team lead) who managed that record.
End would ideally be
Name | Total | Start (Sep 1, 2017) | End (Sep 1, 2017) | Hrs (Sep 1, 2017)
User 1 | 12 | 06:00 Sep 1, 2017 | 18:30 Sep 1, 2017 | 13
User 2 | 0 | | |

All - I discovered at least in my opinion the best way to solve this issue. I stick with the same data cleaning but end up with a little piece like this to create a workable list to add hours and dates together.
month_list = data.columns.tolist()
month_list.remove('Name')
new_list = []
for i in month_list:
if i not in new_list:
new_list.append(i)
for i in new_list:
data[i] = i + " " + data[i].astype(str)
This produces data that looks like:
Name Sep 1, 2017 Sep 1, 2017 Sep 2, 2017 \
0 User 1 Sep 1, 2017 6:00 Sep 1, 2017 18:30 Sep 2, 2017 6:00
1 User 2 Sep 1, 2017 0 Sep 1, 2017 0 Sep 2, 2017 0
2 User 3 Sep 1, 2017 0 Sep 1, 2017 0 Sep 2, 2017 0
3 User 4 Sep 1, 2017 0 Sep 1, 2017 0 Sep 2, 2017 0
4 User 5 Sep 1, 2017 6:00 Sep 1, 2017 12:00 Sep 2, 2017 6:00
Next steps will involve reworkign my code to remove all zero times or creating a filter down the road so as I go through each user, I only am using available times they worked.
Hopefully this will help someone if they ever have a poorly designed timesheet they need to work with.

Related

Can I use a vizualization as a filter for another vizualization in PowerBI?

I created two Vizs in Power BI.
1st Viz: Med with 0% Scans
Filters = MAR Date (Dec 1, 2021 to Feb 28, 2022)
2nd Viz: Scan % by Med
Filters = MAR Date (Jan 1, 2021 to Feb 28, 2022)
DispalyName = Ceftriaxone
If I select Ceftriaxone on 1st Viz - how do I filter the 2nd Viz for Ceftriaxone without affecting the date? I want to keep the MAR dates different.
When I select Ceftriaxone on 1st Viz - even though the MAR dates are different for the 2nd Viz it keeps defaulting to the 1st Viz's MAR date. Is there a way to prevent 1st Viz MAR date from affecting the other graphs?
Viz 1 & 2

Count streaks or consecutive repeat of values DAX

I have a table that is quite simple:
User
Date
Status
John
May 2021
Success
Doe
May 2021
Fail
John
Aug 2021
Fail
Doe
Aug 2021
Fail
Doe
Sep 2021
Success
Doe
Oct 2021
Success
John
OCt 2021
Fail
I want to count how many times a user fails repeatedly but reset the count when it succeeds.
In the example above I would want to add a column like this :
User
Date
Status
Streak
John
May 2021
Success
0
Doe
May 2021
Fail
0
John
Aug 2021
Fail
0
Doe
Aug 2021
Fail
1
Doe
Sep 2021
Success
0
Doe
Oct 2021
Success
0
John
OCt 2021
Fail
1
Now this streak count would have to increase even if the user did not appear in a month as the example shown. I can not use power query and my main concern is the discrepancy in dates since sometimes users can have a streak since they only got tested months apart. and so on.
How about this?
Table.AddColumn(
#"Previous Step",
"Streak",
(r) => if r[Status] = "Fail"
then List.Count(
List.LastN(
Table.SelectRows(
#"Previous Step",
each [User] = r[User] and [Date] < r[Date]
)[Status],
each _ = "Fail"
)
)
else 0,
Int64.Type
)
If the [Status] is "Fail", then it will take the table from the #"Previous Step" and Table.SelectRows just the rows where the [User] is the same as in the current row and the [Date] is before the date in the current row and return just the [Status] column from that filtered table. Treating this single column as a list, it then takes the List.LastN occurrences of "Fail" from that list and does a List.Count of how many of those there are.
can you add conditional column ?
if status = success new_column = 0 , like this.

Regex Pattern for Dates in String

Need help debugging Regex
I have a string column in pandas data frame that contains dates formatted as follows. And there is only one such date in each string.
semicolons are only used to deliminate dates here and not present in actual strings
04/20/2009; 04/20/09; 4/20/09; 4/3/09; 011/14/83;
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
My job is to extract these using regex. Here is the pattern I came up with.
my_pattern = r"((?:(\d{0,2}\d)|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)?[, -./]{0,2}(?:(\d{1,2})[dhnst]{0,2}|(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*?)[, -./]{1,2}(\d{2,4}))|(\d{4})"
sample_series.str.extract(my_pattern, expand=False)
regex_problem_image
So far, I see it work for every date except for the format "Jan 27, 1983", it matches the month name and the date. But the year isn't matched. I am relatively new to regex and I think my pattern design is quite bad too. I need help figuring out what's wrong with my regex expression and how I could debug or improve it. Thanks.
Here is the sample data to make the problem reproducible.
sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
sample_series = pd.Series(sample_list)
From your data :
>>> import pandas as pd
>>> sample_list = ['.Got back to U.S. Jan 27, 1983.\n',
'.On 21 Oct 1983 patient was discharged from Scroder Hospital after EIGHT DAY ADMISSION\n',
'4-13-89 Communication with referring physician?: Not Done\n',
'7intake for follow up treatment at Anson General Hospital on 10 Feb 1983 # 12 AM\n',
'. Pt diagnosed in Apr 1976 after he presented with 2 month history of headaches and gait instability. MRI demonstrated 4 cm L cereballar mass in the paravermian region. He was admitted to PRM and underwent resection complicated by post-op delirium. Post-op sequelas include left palatal myoclonus and ataxia on the left upper and lower extremities which has progressively improved. Pt has not had any evidence of tumor recurrence.\n',
'1-14-81 Communication with referring physician?: Done\n',
'. Went to Emerson, in Newfane Alaska. Started in 2002 at CNM. Generally likes job, does not have time to do what she needs to do. Feels she is working more than should be.\n',
'09/14/2000 CPT Code: 90792: With medical services\n',
'. Sep 2015- Transferred to Memorial Hospital from above. Discharged to MH Partial Hospital on Zoloft, Trazadone and Neurontin but unclear if she followed up.\n',
'Born and raised in Fowlerville, IN. Parents divorced when she was young, states that it was a "bad" divorce. Received her college degree from Allegheny College in 2003. Past verbal, emotional, physical, sexual abuse: No\n']
>>> sample_series = pd.Series(sample_list)
>>> df = sample_series.to_frame()
>>> df
0
0 .Got back to U.S. Jan 27, 1983.\n
1 .On 21 Oct 1983 patient was discharged from Sc...
2 4-13-89 Communication with referring physician...
3 7intake for follow up treatment at Anson Gener...
4 . Pt diagnosed in Apr 1976 after he presented...
5 1-14-81 Communication with referring physician...
6 . Went to Emerson, in Newfane Alaska. Started ...
7 09/14/2000 CPT Code: 90792: With medical servi...
8 . Sep 2015- Transferred to Memorial Hospital f...
9 Born and raised in Fowlerville, IN. Parents d...
We can use a tool called datefinder to find the date in each row :
>>> import datefinder
>>> def find_date(df):
... return [match for match in datefinder.find_dates(df[0])]
>>> df["Vals"] = df.apply(find_date, axis=1)
>>> df
0 Vals
0 .Got back to U.S. Jan 27, 1983.\n [1983-01-27 00:00:00]
1 .On 21 Oct 1983 patient was discharged from Sc... [1983-10-21 00:00:00]
2 4-13-89 Communication with referring physician... [1989-04-13 00:00:00]
3 7intake for follow up treatment at Anson Gener... []
4 . Pt diagnosed in Apr 1976 after he presented... [1976-04-30 00:00:00, 2021-09-02 00:00:00, 202...
5 1-14-81 Communication with referring physician... [1981-01-14 00:00:00]
6 . Went to Emerson, in Newfane Alaska. Started ... [2002-09-30 00:00:00]
7 09/14/2000 CPT Code: 90792: With medical servi... [2000-09-14 00:00:00]
8 . Sep 2015- Transferred to Memorial Hospital f... [2015-09-30 00:00:00]
9 Born and raised in Fowlerville, IN. Parents d... [2003-09-30 00:00:00]

Stata: Aggregating by week

I have a dataset that has a date variable with missing dates.
var1
15sep2014
15sep2014
17sep2014
18sep2014
22sep2014
22sep2014
22sep2014
29sep2014
06oct2014
I aggregated the data using this command.
gen week = week(var1)
and the results look like this
var 1 week
15sep2014 37
15sep2014 37
17sep2014 38
18sep2014 38
22sep2014 38
I was wondering whether it would be possible to get the month name and year in the week variable.
In general, week() is part of the solution if and only if you define your weeks according to Stata's rules for weeks. They are
Week 1 of the year starts on January 1, regardless.
Week 2 of the year starts on January 8, regardless.
And so on, except that week 52 of the year includes 8 or 9 days, depending on
whether the year is leap or not.
Do you use these rules? I guess not. Then the simplest practice is to define a week by whichever day starts the week. If your weeks start on Sundays, then use the rule (dailydate - dow(dailydate)). If your weeks start on Mondays, ..., Saturdays, adjust the definition.
. clear
. input str9 svar1
svar1
1. "15sep2014"
2. "15sep2014"
3. "17sep2014"
4. "18sep2014"
5. "22sep2014"
6. "22sep2014"
7. "22sep2014"
8. "29sep2014"
9. "06oct2014"
10. end
. gen var1 = daily(svar1, "DMY")
. gen week = var1 - dow(var1)
. format week var1 %td
. list
+-----------------------------------+
| svar1 var1 week |
|-----------------------------------|
1. | 15sep2014 15sep2014 14sep2014 |
2. | 15sep2014 15sep2014 14sep2014 |
3. | 17sep2014 17sep2014 14sep2014 |
4. | 18sep2014 18sep2014 14sep2014 |
5. | 22sep2014 22sep2014 21sep2014 |
|-----------------------------------|
6. | 22sep2014 22sep2014 21sep2014 |
7. | 22sep2014 22sep2014 21sep2014 |
8. | 29sep2014 29sep2014 28sep2014 |
9. | 06oct2014 06oct2014 05oct2014 |
+-----------------------------------+
Much more discussion here, here and here, although the first should be sufficient.
Instead of using the week() function, I would probably use the wofd() function to transform your %td daily date into a %tw weekly date. Then you can just play with the datetime display formats to decide exactly how to format the date. For example:
gen date_weekly = wofd(var1)
format date_weekly %twww:_Mon_ccYY
That code should give you this:
var1 date_weekly
15sep2014 37: Sep 2014
15sep2014 37: Sep 2014
17sep2014 38: Sep 2014
18sep2014 38: Sep 2014
22sep2014 38: Sep 2014
This help file will be useful:
help datetime display formats
And if you want to brush up on the difference between %tw and %td dates, you might refresh yourself here:
help datetime

Find matches by condition between 2 datasets in SAS

I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
I need to check a lot of iterations but my goal is to have the code
check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
My desired output dataset woudld be
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
Thanks for any insight!
You can do range matches in two ways. First off, you can match using PROC SQL if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
Then use PROC FORMAT with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
Make sure your date ranges don't overlap - if they do this will fail.
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;