Filter a dataframe - python-2.7

I'm trying to filter a dataframe for a certain date in a column.
The colum entries are timestamps and I try to construct a boolean vector from those,
checking for a certain date.
I tried:
filterfr = df[((df.expiration.month==6) & (df.expiration.day==22) & (df.expiration.year==2002)]
It doesn't work, because 'Series' object has no attribute 'month'.
How can this be done?

When you do df.expiration, you get back a Series where the items are the expiration datetimes.
Try comparing to an actual datetime.datetime object:
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
You may want to look into using a DatetimeIndex, depending on your dataset. This lets you use the convenient syntax
df['2002-06-22']

To have access to the DatetimeIndex methods you have to wrap it in DatetimeIndex (currently*).
The fastest way is to access the day, month and year attributes (just like you attempted):
expir = pd.DatetimeIndex(df['expiration'])
(expir.day == 22) & (expir.month == 6) & (expir.year == 2002)
Alternative, but slower ways are to use the normalize method (to bring it to the start of the day), or to use the date attribute:
pd.DatetimeIndex(df['expiration']).normalize() == datetime.datetime(2002, 06, 22)
pd.DatetimeIndex(df['expiration']).date == datetime.datetime(2002, 06, 22)
*In 0.15 there will be a dt attribute so that you can access these as:
expir = df['expiration']
expir.dt.day ...

This
filterfr = df[df['expiration'] == datetime.datetime(2002, 6, 22)]
worked fine.
However, after doing some filtering, I got an error,
when trying to do filterfr.expiration[0]
or filterfr['expiration'][0]
to get the first element in the series.
KeyError: 0L is raised, although there are elements in the series.
The series looks like this:
Name: expiration, Length: 534668, dtype: datetime64[ns]
Shouldn't this actually always work?

Related

Convert a number column into a time format in Power BI

I'm looking for a way to convert a decimal number into a valid HH:mm:ss format.
I'm importing data from an SQL database.
One of the columns in my database is labelled Actual Start Time.
The values in my database are stored in the following decimal format:
73758 // which translates to 07:27:58
114436 // which translates to 11:44:36
I cannot simply convert this Actual Start Time column into a Time format in my Power BI import as it returns errors for some values, saying it doesn't recognise 73758 as a valid 'time'. It needs to have a leading zero for cases such as 73758.
To combat this, I created a new Text column with the following code to append a leading zero:
Column = FORMAT([Actual Start Time], "000000")
This returns the following results:
073758
114436
-- which is perfect. Exactly what I needed.
I now want to convert these values into a Time.
Simply changing the data type field to Time doesn't do anything, returning:
Cannot convert value '073758' of type Text to type Date.
So I created another column with the following code:
Column 2 = FORMAT(TIME(LEFT([Column], 2), MID([Column], 3, 2), RIGHT([Column], 2)), "HH:mm:ss")
To pass the values 07, 37 and 58 into a TIME format.
This returns the following:
_______________________________________
| Actual Start Date | Column | Column 2 |
|_______________________________________|
| 73758 | 073758 | 07:37:58 |
| 114436 | 114436 | 11:44:36 |
Which is what I wanted but is there any other way of doing this? I want to ideally do it in one step without creating additional columns.
You could use a variable as suggested by Aldert or you can replace Column by the format function:
Time Format = FORMAT(
TIME(
LEFT(FORMAT([Actual Start Time],"000000"),2),
MID(FORMAT([Actual Start Time],"000000"),3,2),
RIGHT([Actual Start Time],2)),
"hh:mm:ss")
Edit:
If you want to do this in Power query, you can create a customer column with the following calculation:
Time.FromText(
if Text.Length([Actual Start Time])=5 then Text.PadStart( [Actual Start Time],6,"0")
else [Actual Start Time])
Once this column is created you can drop the old column, so that you only have one time column in the data. Hope this helps.
I, on purpose show you the concept of variables so you can use this in future with more complex queries.
TimeC =
var timeStr = FORMAT([Actual Start Time], "000000")
return FORMAT(TIME(LEFT([timeStr], 2), MID([timeStr], 3, 2), RIGHT([timeStr], 2)), "HH:mm:ss")

Pandas: SettingWithCopyWarning, trying to understand how to write the code better, not just whether to ignore the warning

I am trying to change all date values in a spreadsheet's Date column where the year is earlier than 1900, to today's date, so I have a slice.
EDIT: previous lines of code:
df=pd.read_excel(filename)#,usecols=['NAME','DATE','EMAIL']
#regex to remove weird characters
df['DATE'] = df['DATE'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
df['DATE'] = pd.to_datetime(df['DATE'])
sample row in dataframe: name, date, email
[u'Public, Jane Q.\xa0' u'01/01/2016\xa0' u'jqpublic#email.com\xa0']
This line of code works.
df["DATE"][df["DATE"].dt.year < 1900] = dt.datetime.today()
Then, all date values are formatted:
df["DATE"] = df["DATE"].map(lambda x: x.strftime("%m/%d/%y"))
But I get an error:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-
versus-copy
I have read the documentation and other posts, where using .loc is suggested
The following is the recommended solution:
df.loc[row_indexer,col_indexer] = value
but df["DATE"].loc[df["DATE"].dt.year < 1900] = dt.datetime.today() gives me the same error, except that the line number is actually the line number after the last line in the script.
I just don't understand what the documentation is trying to tell me as it relates to my example.
I started messing around with pulling out the slice and assigning to a separate dataframe, but then I'm going to have to bring them together again.
You are producing a view when you df["DATE"] and subsequently use a selector [df["DATE"].dt.year < 1900] and try to assign to it.
df["DATE"][df["DATE"].dt.year < 1900] is the view that pandas is complaining about.
Fix it with loc like this:
df.loc[df.DATE.dt.year < 1900, "DATE"] = pd.datetime.today()
My thought would be that you could do
df.loc[df.DATE.dt.year < 1900, "DATE"] = dt.datetime.today()
df.loc[:, "DATE"] = df.DATE.map(lambda x: x.strftime("%m/%d/%y")
Not at a computer so I can't test but I think that should do it.

Grab rows between two Datetime and avoid iterating

I use Pandas to retrieve a lot of Data via an SQL query (from Hive). I have a big DataFrame now:
market_pings = pandas.read_sql_query(query, engine)
market_pings['event_time'] = pandas.to_datetime(market_pings['event_time'])
I have calculated Time Delta periods which are: if something interesting happens within the timeline of these events within this market_pings DataFrame, I want the logs of that time interval only.
To grab DataFrame rows where a column has certain values there is a cool trick:
valuelist = ['value1', 'value2', 'value3']
df = df[~df.column.isin(value_list)]
Does anyone have an idea how to do this for time periods, so that I get the events of certain times from the market_pings DataFrame without direct Iteration (row by row)?
I can build a list of periods (1s accuracy) like:
2015-08-03 19:19:47
2015-08-03 19:20:00
But this means my valuelist becomes a tupel and I somehow have to compare dates.
You can create a list of time stamp as value_list and do operation you intend to.
time_list = [pd.Timestamp('2015-08-03 19:19:47'),pd.Timestamp('2015-08-03 19:20:00') ]
One thing in using between_time() is index have to be that date or time,
If not you can set by set_index()
mydf = pd.Series(np.random.randn(4), time_list)
mydf
Out[123]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:20:00 -0.234267
2015-08-03 19:19:48 0.159056
2015-08-03 21:20:00 -0.842017
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),include_end=False)
Out[124]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:19:48 0.159056
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),
include_end=False,include_start=False)
Out[125]:
2015-08-03 19:19:48 0.159056
dtype: float64

What is the best way to populate a load file for a date lookup dimension table?

Informix 11.70.TC4:
I have an SQL dimension table which is used for looking up a date (pk_date) and returning another date (plus1, plus2 or plus3_months) to the client, depending on whether the user selects a "1","2" or a "3".
The table schema is as follows:
TABLE date_lookup
(
pk_date DATE,
plus1_months DATE,
plus2_months DATE,
plus3_months DATE
);
UNIQUE INDEX on date_lookup(pk_date);
I have a load file (pipe delimited) containing dates from 01-28-2012 to 03-31-2014.
The following is an example of the load file:
01-28-2012|02-28-2012|03-28-2012|04-28-2012|
01-29-2012|02-29-2012|03-29-2012|04-29-2012|
01-30-2012|02-29-2012|03-30-2012|04-30-2012|
01-31-2012|02-29-2012|03-31-2012|04-30-2012|
...
03-31-2014|04-30-2014|05-31-2014|06-30-2014|
........................................................................................
EDIT : Sir Jonathan's SQL statement using DATE(pk_date + n UNITS MONTH on 11.70.TC5 worked!
I generated a load file with pk_date's from 01-28-2012 to 12-31-2020, and plus1, plus2 & plus3_months NULL. Loaded this into date_lookup table, then executed the update statement below:
UPDATE date_lookup
SET plus1_months = DATE(pk_date + 1 UNITS MONTH),
plus2_months = DATE(pk_date + 2 UNITS MONTH),
plus3_months = DATE(pk_date + 3 UNITS MONTH);
Apparently, DATE() was able to convert pk_date to DATETIME, do the math with TC5's new algorithm, and return the result in DATE format!
.........................................................................................
The rules for this dimension table are:
If pk_date has 31 days in its month and plus1, plus2 or plus3_months only have 28, 29, or 30 days, then let plus1, plus2 or plus3 equal the last day of that month.
If pk_date has 30 days in its month and plus1, plus2 or plus3 has 28 or 29 days in its month, let them equal the last valid date of those month, and so on.
All other dates fall on the same day of the following month.
My question is: What is the best way to automatically generate pk_dates past 03-31-2014 following the above rules? Can I accomplish this with an SQL script, "sed", C program?
EDIT: I mentioned sed because I already have more than two years worth of data and
could perhaps model the rest after this data, or perhaps a tool like awk is better?
The best technique would be to upgrade to 11.70.TC5 (on 32-bit Windows; generally to 11.70.xC5 or later) and use an expression such as:
SELECT DATE(given_date + n UNITS MONTH)
FROM Wherever
...
The DATETIME code was modified between 11.70.xC4 and 11.70.xC5 to generate dates according to the rules you outline when the dates are as described and you use the + n UNITS MONTH or equivalent notation.
This obviates the need for a table at all. Clearly, though, all your clients would also have to be on 11.70.xC5 too.
Maybe you can update your development machine to 11.70.xC5 and then use this property to generate the data for the table on your development machine, and distribute the data to your clients.
If upgrading at least someone to 11.70.xC5 is not an option, then consider the Perl script suggestion.
Can it be done with SQL? Probably, but it would be excruciating. Ditto for C, and I think 'no' is the answer for sed.
However, a couple of dozen lines of perl seems to produce what you need:
#!/usr/bin/perl
use strict;
use warnings;
use DateTime;
my #dates;
# parse arguments
while (my $datep = shift){
my ($m,$d,$y) = split('-', $datep);
push(#dates, DateTime->new(year => $y, month => $m, day => $d))
|| die "Cannot parse date $!\n";
}
open(STDOUT, ">", "output.unl") || die "Unable to create output file.";
my ($date, $end) = #dates;
while( $date < $end ){
my #row = ($date->mdy('-')); # start with pk_date
for my $mth ( qw[ 1 2 3 ] ){
my $fut_d = $date->clone->add(months => $mth);
until (
($fut_d->month == $date->month + $mth
&& $fut_d->year == $date->year) ||
($fut_d->month == $date->month + $mth - 12
&& $fut_d->year > $date->year)
){
$fut_d->subtract(days => 1); # step back until criteria met
}
push(#row, $fut_d->mdy('-'));
}
print STDOUT join("|", #row, "\n");
$date->add(days => 1);
}
Save that as futuredates.pl, chmod +x it and execute like this:
$ futuredates.pl 04-01-2014 12-31-2020
That seems to do the trick for me.

How to get Date between two Dates Django

I need to check is there any object exist for given time Interval? How can I do that?How can I translate this Mysql into Django:
SELECT *
FROM `event_event`
WHERE (startDate BETWEEN "2010-10-1" AND "2010-10-5")
OR (endDate BETWEEN "2010-10-1" AND "2010-10-5")
I am currently using
Event.objects.filter(Q(startDate__range(datetime(2010,10,1),datetime(2010,10,5)))|Q(endDate__range(datetime(2010,10,1),datetime(2010,10,5))))
But I am not getting any object when I am using Django filter.Please suggest me where I am wrong.
Do
print Event.objects.filter(Q(startDate__range(datetime(2010,10,1),datetime(2010,10,5)))|Q(endDate__range(datetime(2010,10,1),datetime(2010,10,5)))).query
And see what SQL it produces, it'll help you spot the differences.
Try To Do This:
Event.objects.filter(Q(startDate >= datetime(2010, 10, 1), startDate <= datetime(2010, 10, 5)) | Q(endDate >= datetime(2011, 10, 1), endDate <= datetime(2010, 10, 5)))