Since I didn't find anywhere else this topic I will ask it here. I am getting data from CSV file, I have written datetime format in one of columns. I get that column with pandas module and then I need to count occurrences in specific time slots and plot that with matplotlib. Bellow you can see example of column.
Time and Date
0 2015-08-21 10:51:06.398000
1 2015-08-21 10:51:00.017000
2 2015-08-21 10:52:06.402000
3 2015-08-21 10:54:06.407000
...
I know how I can split time like so:
pd.date_range("10:50", "12:30", freq="1min").time
But how can I assign occurrences of my read values from CSV and then plot it? Any advice or direction would help.
It's hard to tell what you want as you haven't posted desired output but if I understand you correctly you want to count the number of rows in time intervals of certain length. You can do this by combining resample and len. To use resample, first set the index to 'Time and Date:
df.set_index('Date and Time', drop=False)
Note that drop=False is only necessary if the data frame has no other columns.
Then to get the number of rows in each 1-minute interval do
counts = df.resample('1min', len).astype(int)
If there are multiple dates and you want to sum the counts for each time interval over dates do
counts.groupby(lambda ts: ts.time()).sum()
Related
I use Oracle APEX (v22.1) and on a page I created a (line) chart, but I have the following problem for the visualization of the graphic:
On the y-axis it is not possible to show the values in the format 'hh:mi' and I need a help for this.
Details for the axis:
x-axis: A date column represented as a string: to_char(time2, 'YYYY-MM')
y-axis: Two date columns and the average of the difference will be calculated: AVG(time2 - time1); the date time2 is the same as the date in the x-axis.
So I have the following SQL query for the visualization of the series:
SELECT DISTINCT to_char(time2, 'YYYY-MM') AS YEAR_MONTH --x-axis,
AVG(time2 - time1) AS AVERAGE_VALUE --y-axis
FROM users
GROUP BY to_char(time2, 'YYYY-MM')
ORDER BY to_char(time2, 'YYYY-MM')
I have another problem to solve it in another way: I am not familiar with JavaScript, if the solution is only possible in this way. Because I started new with APEX, but I have seen in different tutorials that you can use JS. So, when JS is the only solution, I would be happy to get a short description what I must do on the page.
(I don't know if this point is important for this case: The values time1 and time2 are updated daily.)
On the attributes of the chart I enabled the 'Time Axis Type' under Settings
On the y-axis I change the format to "Time - Short" and I tried with different pattern like ##:## but in this case you see for every value and also on the y-axis the value '01:00' although the line chart was represented in the right way. But when I change the format to Decimal the values are shown correct as the line chart.
I also tried it with the EXTRACT function for the value like 'EXTRACT(HOUR FROM AVG(time2 - time1))|| ':' || EXTRACT(MINUTE FROM AVG(time2 - time1))' but in this case I get an error message
So where is my mistake or is it more difficult to solve this?
ROUND(TRUNC(avg(time2 - time1)/60) + mod(avg(time2 - time1),60)/100, 2) AS Y
will get close to what you want, you can set Y Axis minimum 0 maximum 24
then 12.23 means 12 hour and 23 minutes.
I have a pandas dataframe timeseries weight for over 100 scales named per "short_id". I am having trouble figuring out the best way to apply a moving filter for each scale's weight data to remove outliers.
Here is a sample of the data:
Out[159]:
published_at short_id weight
0 2017-11-08 16:03:36 INT16 50.35
1 2017-11-08 16:02:43 INT1 45.71
2 2017-11-08 16:02:10 NOT11 35.52
3 2017-11-08 16:01:07 INT7 50.03
4 2017-11-08 16:00:23 INT3 47.04
converting the dataframe into a dictionary per "short_id" and apply moving filter per dict item did not work out, nor did converting the data to a "wide" format from "long" (using pandas.pivot_table).
It seems like it could be possible in one line using groupy.by then .apply the rolling function:
df['MovingFilt'] = df.groupby('short_id')['weight'].apply(pd.rolling(6).median())
but receive an error: TypeError: incompatible index of inserted column with frame index...This is because sometimes there is weight data at the same time for certain scales, but not usually.
Is this the best way to approach the problem?:
Creating new dataframes per 'short_id' then using seems not pythonic enough, although it runs fine
INT16['MovingFilt'] = pd.Series.rolling(INT16['weight'], window=6, center=True).median()
The error is because you wrote the groupby wrong
df['MovingFilt'] = df.groupby('short_id')['weight'].rolling(6).median().values
I have the below-mentioned dataset.
https://docs.google.com/spreadsheets/d/13GCAXHp5BU4vYU6PdX40wM-Jhp--LeRd9C5oUurbVY4/edit#gid=0
I want to find the cumulative values for sales for difference stores in one column. For example, the cumulative value for store 2106 the sales figure should be 176,849
I'm using the following function
df = df.groupby('storenumber')['sales'].cumsum() but i am not getting the correct result
Can someone help?
Here's what I did to solve this problem.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv') # get data frame from csv file
You won't be able to run numerical operations on your data, as it is, because the Sale (Dollars) column in df is not formatted as a numerical type. The following piece of code will convert the data in the Sale (Dollars) and Suggested answer column to be of type float and remove the dollar sign and separating commas.
df[df.columns[2:]] = df[df.columns[2:]].replace('[\$,]', '', regex=True).astype(float)
Then, I used the following bit of code to get the cumulative value for each unique Store Number.
cum_sales_by_store_number = df.groupby('Store Number')['Sale (Dollars)'].agg(np.sum)
cum_sales_by_store_number = pd.DataFrame(cum_sales_by_store_number)
Output for cum_sales_by_store_number:
Sale (Dollars)
Store Number
2106 176849.97
I hope this answers your question. Happy coding!
I am working on a real estate cash-flow simulation.
What I want in the end is a time series where everyday I report if the property is vacant, leased and if I collected rent.
In my present code, I create first a profit array with values of "Leased", "Vacant" or "Today you collected rent of $1000", so I used this to create my time series:
rng=pd.date_range('6/1/2016', periods=len(profit), freq='D')
ts=pd.Series(profit, index=rng)
To simplify, I assumed I collected rent every 30 days. Now I want to be more specific and collect it every 5th day of the month (for example) and be flexible on the day the next tenant will move in.
Do you know commands or a good source where I can learn how to iterate from month to month?
Any help would be appreciated
You can build a sequence of dates using date_range and .shift() (freq='M' is for month-end frequencies) with pd.datetools.day like so:
date_sequence = pd.date_range(start, end, freq='M').shift(num_of_days, freq=pd.datetools.day)
and then use this sequence to select dates from the DateTimeIndex using
df.loc[date_sequence, 'column_name'] = value
Alternatively, you can use pd.DateOffset() like so:
ts = pd.date_range(start=date(2015, 6, 1), end=date(2015, 12, 1), freq='MS')
DatetimeIndex(['2015-06-01', '2015-07-01', '2015-08-01', '2015-09-01',
'2015-10-01', '2015-11-01', '2015-12-01'],
dtype='datetime64[ns]', freq='MS')
Now add 5 days:
ts + pd.DateOffset(days=5)
to get:
DatetimeIndex(['2015-06-06', '2015-07-06', '2015-08-06', '2015-09-06',
'2015-10-06', '2015-11-06', '2015-12-06'],
dtype='datetime64[ns]', freq=None)
I am very new to programming and am working with Python. For a work project I am trying to read several .csv files, convert them to data frames, concatenate some of the fields into one for a column header, and then append all of the dataframes into one big DataFrame. I have searched extensively in StackOverflow as well as in other resources but I have not been able to find an answer. Here is the code I have thus far along with some abbreviated output:
import pandas as pd
import glob
# Read a directory of files to a list
csvlist = []
for f in glob.glob("AssayCerts/*"):
csvlist.append(f)
csvlist
['AssayCerts/CH09051590.csv', 'AssayCerts/CH09051591.csv', 'AssayCerts/CH14158806.csv', 'AssayCerts/CH14162453.csv', 'AssayCerts/CH14186004.csv']
# Read .csv files and convert to DataFrames
dflist = []
for csv in csvlist:
df = pd.read_csv(filename, header = None, skiprows = 7)
dflist.append(df)
dflist
[ 0 1 2 3 4 5 \
0 NaN Au-AA23 ME-ICP41 ME-ICP41 ME-ICP41 ME-ICP41
1 SAMPLE Au Ag Al As B
2 DESCRIPTION ppm ppm % ppm ppm
#concatenates the cells in the first three rows of the last dataframe; need to apply this to all of the dataframes.
for df in dflist:
column_names = df.apply(lambda x: str(x[1]) + '-'+str(x[2])+' - '+str(x[0]),axis=0)
column_names
0 SAMPLE-DESCRIPTION - nan
1 Au-ppm - Au-AA23
2 Ag-ppm - ME-ICP41
3 Al-% - ME-ICP41
I am unable to apply the last operation across all of the DataFrames. It seems I can only get it to apply to the last DataFrame in my list. Once I get past this point I will have to append all of the DataFrames to form one large DataFrame.
As Andy Hayden mentions in his comment, the reason your loop only appears to work on the last DataFrame is that you just keep assigning the result of df.apply( ... ) to column_names, which gets written over each time. So at the end of the loop, column_names always contains the results from the last DataFrame in the list.
But you also have some other problems in your code. In the loop that begins for csv in csvlist:, you never actually reference csv - you just reference filename, which doesn't appear to be defined. And dflist just appears to have one DataFrame in it anyway.
As written in your problem, the code doesn't appear to work. I'd advise posting the real code that you're using, and only what's relevant to your problem (i.e. if building csvlist is working for you, then you don't need to show it to us).