Create 8760 dataframe with hourly resolution timestamps - python-2.7

A bit of a naive question. I want to create a dataframe that spans a full calendar year in hourly resolution (8760 values). How can I manipulate the following lines from a tutorial to pull data from the previous year.
start = pd.Timestamp(datetime.date.today(), tz=time_zone) #used for testing
end = start + pd.Timedelta(days=365) #to get all day values
Essentially I want to replace today() with 1/1/2016, and then pull historical forecasted values for my analysis.

You can build start by subtracting a year from whatever your end date is:
date_str = '1/1/2016'
start = pd.to_datetime(date_str) - pd.Timedelta(days=365)
hourly_periods = 8760
drange = pd.date_range(start, periods=hourly_periods, freq='H')
Then when you're ready to make a data frame, set index=drange, e.g.:
# toy example data
data = list(range(len(drange)))
# create data frame with drange index
df = pd.DataFrame(data, index=drange)
See Pandas docs for date_range and Timedeltas for more.

Related

Creating a normalized x axis in Power BI?

I am trying to create a plot which normalizes the x axis based on the date of an event listed in a different query. I have the production data sitting in one query, which contains the value per day. I have a list of events in a second query, which have specific dates at which they occured. My goal is to create some measure or function that will plot each event on the axis in a normalized fashion where the production -1y to 1y from the event is shown on the x axis. This will allow a comparison of the last year of production before the event versus the next year, which can gauge success of the event. I don't know how to successfully do this though, and would appreciate any insight.
Current plot with event dates in table on the right
A person who did something similar merged the two queries and created a column which finds the date of the event, and determines how far in time it has been since (30d since the event is given 30). However, because I have about 70 data points which each have 4-5 of their own events, this merged query duplicated the production for each event, so there is 4-5 logs of production data per data point, which has become hard to manage as I try to do something similar and understand what they're doing. I believe there's a better way to do this but I can't figure out how to connect the two queries in a more efficient way.
Normalized production from day 0 of each event that I'm trying to create
My code below is my attempt at making a measure that will create this normalized axis for me, but I am getting a circular dependency error.
Normalized Producing Days From Workover =
//These will reference the date values in the measures made of the workover start and end dates
Var startofworkover = DATEVALUE([Measure: WKO Start Date])
var endofworkover = DATEVALUE([Measure: WKO End Date])
//this will filter to dates after the workover
var afterworkover = FILTER('Production Data',
'Production Data'[Date]>=endofworkover && 'Production Data'[Well Pair] = EARLIER('Production Data'[Well Pair]) && [Allocated Oil Production (m3/d)] >0)
//this will filter to dates before the workover
var beforeworkover = FILTER('Production Data',
'Production Data'[Date]<startofworkover && 'Production Data'[Well Pair] = EARLIER('Production Data'[Well Pair]) && [Allocated Oil Production (m3/d)] >0)
var result1 = SWITCH(TRUE(),'Production Data'[Date] = startofworkover,0,'Production Data'[Date]>=endofworkover,RANKX(afterworkover,'Production Data'[Date],,ASC,Dense),'Production Data'[Date]<startofworkover,RANKX(beforeworkover,'Production Data'[Date],,DESC,Dense)*-1)
return result1

Efficiently aggregrate (fitler/select) a large dataframe in a loop and create new dataframe

I have 1 large dataframe that is created by importing a csv file (sparkscv). This dataframe has many rows of daily data. The data is identified by date, region, service_offered and count.
I filter for region and service_offered and aggregate the count (sum) and roll that up to month. Each time the filter is run in the loop, it selects a region, then a service_offered and aggregates that.
if I append that to the df over and over the big 0 starts to happen and it becomes very slow. There are 360 offices and about 5-10 services per office. How do I save a select/filter to a list first and append those before making the final dataframe?
I saw this post Using pandas .append within for loop but it only shows for list.a and list.b. What about 360 lists?
Here is my code that loops/aggregates the data
#spark session
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
#spark schema
schema = StructType([
StructField('office', StringType(), True),
StructField('service', StringType(), True),
StructField('date', StringType(), True),
StructField('count', IntegerType(), True)
])
#empty dataframe
office_summary = spark.createDataFrame(spark.sparkContext.emptyRDD(),schema)
office_summary .printSchema()
x = 1
try :
for office in office_lookup :
office = office[0]
print(office_locations_count - x, " office(s) left")
x = x + 1
for transaction in service_lookup :
transaction = transaction[0]
monthly_counts = source_data.filter((col("office").rlike(office)) & (col("service").rlike(transaction))).groupby("office", "service", "date").sum()
#office_summary = office_summary.unionAll(monthly_counts)
except Exception as e:
print(e)
I know there are issues with like returning more results than expected, but that is not a problem with the current data. the first 30% of the process is very quick and then it starts to slow down as expected.
How do I save a filter result to a list, append or join that over and over and finally create the final dataframe? This code does finish, but it should not take 30 minutes to run.
Thanks!
With the help from #Andrew, I was able to continue to use Pyspark Dataframes to accomplish this. The command scrubbed looks like this.
df2 = df1.groupby("Column1", "Column2", "Column3").agg(sum('COUNT'))
This allowed me to create a new dataframe based off df1 where grouping and aggregation was satisfied on one line. This command takes about 0.5 seconds to execute, over the other way in the initial post.
The constant creation of a new dataframe and retaining the old ones in memory were the problem. This is the most efficient way to do it.

How to select the last value of the day with DAX in Power BI

I am new to power BI and stuck with an issue. I have my model as follows:
Date Dimension
Measurement Fact
The date column in Date Dimension is link to measuredate in Measurement Fact
Below is a sample data:
NB: In edit query, I have changed the type of measuredate to Date only.
I have tried the measure below but it doesn't work the way I want. It will sum all the values of the day but what I want is the last value of the day:
day_fuel_consumption =
CALCULATE (
SUM ( measurement[measurementvalue] ),
FILTER (
measurement,
measurement[metername] = "C-FUEL"
&& measurement[measuredate] = MAX ( measurement[measuredate] )
)
)
My Goal is to get 29242, i.e the last value of the day. Remember that measuredate is a Date field and not Datetime (I changed to Date field so that my Year and Month filter can work correctly). I have changed the type in edit query.
Changing your measure to use a variable could be the solution:
DFC =
var maxDate = MAX(measurement[measuredate])
return
CALCULATE(
SUM(measurement[measurementvalue]),
measurement[measuredate] = maxDate
)
However, you should keep the datetime format for measureDate. If you don't want to see the time stamp just change the format I power bi. Otherwise power bi will see two values with max date and sum them, instead of taking the last one.
Well, if you want to avoid creating a measure, you could drag the fields you are filtering over to the visual filters pane. Click your visual, and scroll a tiny bit and you will see the section I am referring to. From there, just drag the field you are trying to filter In this case, your value. Then select "Top N". It will allow you to select a top (number) or bottom (number) based on another field. Strange enough, it does allow you to do top value by top value. It doesn't make sense when you say it out loud, but it works all the same.
This will show you the top values for whatever value field you are trying to use. As an added bonus, you can show how little or how many you want, on the fly.
As far as DAX goes, I'm afraid I am a little DAX illiterate compared to some other folks that may be able to help you.
I had to create two separate measures as shown below for this to work as I wanted:
max_measurement_id_cf = CALCULATE(MAX(measurement[measurementid]), FILTER(measurement, measurement[metername] = "C-FUEL"))
DFC =
var max_id_cf = [max_measurement_id_cf]
return
CALCULATE(SUM(measurement[measurementvalue]), measurement[measurementid] = max_id_cf)

Trying to get the time delta between two date columns in a dataframe, where one column can possibly be "NaT"

I am trying to get the difference in days between two dates pulled from a SQL database. One is the start date, the other is a completed date. The completed date can and is, in this case, be a NaT value. Essentially I'd like to iterate through each row, and take the difference, if the completed date is NaT I'd like to skip it or assign a NaN value, in a completely new delta column. the code below is giving me this error: 'member_descriptor' object is not callable
for n in df.FINAL_DATE:
df.DELTA = [0]
if n is None:
df.DELTA = None
break
else:
df.DELTA = datetime.timedelta.days(df['FINAL_DATE'], df['START_DATE'])
So, you first have to check if any of the dates are NaT. If not, you can calculate by:
df.DELTA = (df['FINAL_DATE'] - df['START_DATE']).days

Creation of formatted DataFrame and then adding data line by line

I have a continuous stream of data coming in so I want to define the DataFrame before hand so that I don't have tell pandas to format data or set index
So I want to create a DataFrame like
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"])
but I want to tell Pandas that index of data should be timestamp and that the format would be
"%Y-%m-%d %H:%M:%S:%f"
and one this it set, then I would read through file and pass data to the DataFrame initialized
I get data in variables like these populated every time in loop like
for line in filehandle:
timestamp, stockname, price, volume = fetch(line)
here I want to update the "df"
this update would go on while I would keep using the copy of
df
let us say into a
tempdf
to do re-sampling or any other task at any given point in time because original dataframe
df
is getting updated continuously
import numpy as np
import pandas as pd
import datetime as dt
import time
# create df with timestamp as index
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"], dtype = float)
pd.to_datetime(df['timestamp'], format = "%Y-%m-%d %H:%M:%S:%f")
df.set_index('timestamp', inplace = True)
for i in range(10): # for the purposes of functioning demo code
i += 1 # counter
time.sleep(0.01) # give jupyter notebook a moment
timestamp = dt.datetime.now() # to be used as index
df.loc[timestamp] = ['AAPL', np.random.randint(1000), np.random.randint(10)] # replace with your database read
tempdf = df.copy()
If you are reading a file or database continuously, you can replace the for: loop with what you described in your question. #MattR's questions should also be addressed; if you need to continuously log or update data, I am not sure if pandas is the best solution.