I am trying to grab the formatted current data and create a variable from it using the command:
gen %tdCY-N-D final_dayinpt = date(c(current_date), "DMY")
However, I am getting an error
%tdCY invalid name
r(198);
If I display this at the Stata command line it works:
. display %tdCY-N-D date(c(current_date), "DMY")
2020-10-27
How can I create this formatted variable?
Solution:
set obs 10 // so the example works
generate final_dayinpt = date(c(current_date), "DMY")
format final_dayinpt %tdCY-N-D
The syntax you're trying is tempting because the format you use for display is seems analogous to things like generate byte bytevar = 1, but, as you found the analogy doesn't hold here.
Note that you are passing format information where type is expected based on the generate syntax (help generate):
generate [type] newvar[:lblname] =exp [if] [in] [, before(varname) | after(varname)]
While consulting help generate and help display are helpful, help datetime is also very useful here (and was never obvious to me).
See here for a more thorough treatment of working with dates in Stata.
Edit:
An alternative suggested (and made possible by) Nick Cox:
ssc install numdate // install the package which Nick wrote
generate current_date = c(current_date) // numdate takes a varlist
numdate daily final_dayinpt = current_date, pattern(DMY) format(%tdCY-N-D)
Related
I have the below-mentioned dataset.
https://docs.google.com/spreadsheets/d/13GCAXHp5BU4vYU6PdX40wM-Jhp--LeRd9C5oUurbVY4/edit#gid=0
I want to find the cumulative values for sales for difference stores in one column. For example, the cumulative value for store 2106 the sales figure should be 176,849
I'm using the following function
df = df.groupby('storenumber')['sales'].cumsum() but i am not getting the correct result
Can someone help?
Here's what I did to solve this problem.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv') # get data frame from csv file
You won't be able to run numerical operations on your data, as it is, because the Sale (Dollars) column in df is not formatted as a numerical type. The following piece of code will convert the data in the Sale (Dollars) and Suggested answer column to be of type float and remove the dollar sign and separating commas.
df[df.columns[2:]] = df[df.columns[2:]].replace('[\$,]', '', regex=True).astype(float)
Then, I used the following bit of code to get the cumulative value for each unique Store Number.
cum_sales_by_store_number = df.groupby('Store Number')['Sale (Dollars)'].agg(np.sum)
cum_sales_by_store_number = pd.DataFrame(cum_sales_by_store_number)
Output for cum_sales_by_store_number:
Sale (Dollars)
Store Number
2106 176849.97
I hope this answers your question. Happy coding!
I have a script that processes an Excel file. The department that sends it has a system that generated it, and my script stopped working.
I suddenly got the error Can only use .str accessor with string values, which use np.object_ dtype in pandas for the following line of code:
df['DATE'] = df['Date'].str.replace(r'[^a-zA-Z0-9\._/-]', '')
I checked the type of the date columns in the file from the old system (dtype: object) vs the file from the new system (dtype: datetime64[ns]).
How do I change the date format to something my script will understand?
I saw this answer but my knowledge about date formats isn't this granular.
You can use apply function on the dataframe column to convert the necessary column to String. For example:
df['DATE'] = df['Date'].apply(lambda x: x.strftime('%Y-%m-%d'))
Make sure to import datetime module.
apply() will take each cell at a time for evaluation and apply the formatting as specified in the lambda function.
pd.to_datetime returns a Series of datetime64 dtype, as described here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df['DATE'] = df['Date'].dt.date
or this:
df['Date'].map(datetime.datetime.date)
You can use pd.to_datetime
df['DATE'] = pd.to_datetime(df['DATE'])
I wonder if you could help me figure it out a quite simple question of how I can save an extracted statistic from a regression in a separate dataset (or file), and also add more statistics from the other regression to it later?
For example, the statistic from one regression can be extracted as e(f) and from another one is also as e(f).
Roger Newson's parmest is great for dealing with "resulsets," which are Stata datasets created from the output of a Stata command. The help file has a nice example of combining three regressions into a single file that I modified here to include R^2 [stored in e(df_r)]:
sysuse auto, clear
tempfile tf1 tf2 tf3
parmby "reg price weight", lab saving(`"`tf1'"', replace) idnum(1) idstr(M1) escal(r2)
parmby "reg price foreign", lab saving(`"`tf2'"', replace) idnum(2) idstr(M2) escal(r2)
parmby "reg price weight foreign", lab saving(`"`tf3'"', replace) idnum(3) idstr(M3) escal(r2)
drop _all
append using `"`tf1'"' `"`tf2'"' `"`tf3'"'
list idnum idstr es_1, noobs nodis
Hi I'm trying to create a pie chart that has a lot of slices. For some reason I get an error when running this code.
My code
graph pie ccounter if year==1900 & ccounter>100 & labforce==2, over(occ1950)
and I get this error
(note: areastyle p193pie not found in scheme, default attributes used)
(note: areastyle p194pie not found in scheme, default attributes used)
(note: areastyle p195pie not found in scheme, default attributes used)
(note: areastyle p196pie not found in scheme, default attributes used)
option min() incorrectly specified
Note that the variable occ1950 has more than 100 values. I don't know whether this is what causing the problem.
Extra Information
I use this code to create the variable ccounter
bys mcdstr year occ: gen counter=_n
bys mcdstr year occ: egen ccounter=max(counter)
I used this to calculate the number of people working in each industry by year and location.
The problem lies in that the variable occ1950 has too much unique values. Let us examine the problem using a CSV dataset of only 40 countries.
country,fdi
Afghanistan,141.391
Algeria,541.478
Angola,238.637
Antigua and Barbuda,1.653
Argentina,205.691
Bahamas,21.927
Bahrain,1.317
Bangladesh,50.298
Barbados,2.816
"Bolivia, Plurinational State of",41.572
Botswana,87.649
Brazil,455.5649999999999
British Virgin Islands,12387.568
Brunei Darussalam,21.02
Cambodia,672.6800000000001
Cameroon,30.159
Cape Verde,3.783
Cayman Islands,15323.116
Chile,53.149
Colombia,49.047
Congo,112.104
"Congo, Democratic Rep. of",302.505
Costa Rica,.826
Côte d' Ivoire,27.099
Dominican Republic,.112
Ecuador,93.673
Egypt,191.59
Equatorial Guinea,80.21700000000001
Eritrea,16.269
Ethiopia,205.824
Fiji,38.742
Gabon,76.66500000000001
Ghana,129.068
Guinea,97.413
Guyana,79.899
Honduras,2.6
"Hong Kong, China",124987.422
India,291.567
Indonesia,850.0709999999999
"Iran, Islamic Republic of",480.71
After loading this into Stata 14, we can observe that
graph pie fdi, over(country)
produces the error: option min() incorrectly specified.
If we now reduce the dataset to simply 30 countries by: drop _n > 30. We would be able to get a pie chart.
This suggests that you should collapse your data, take the n categories with the largest ccounter and then classify the other classes as "other".
The magic number is 36. So you can have at most 36 unique categories in your pie chart.
I have a data set that looks as follows in a CSV file:
Date Sample
01-AUG-09 Sample 1
02-Aug-09 Sample 2
etc...
When I use Pandas, I read in the file with the following code:
in_file = pd.read_csv('File Name.csv', parse_dates = True)
However, it is not recognizing the date column properly. Does anybody know if the Pandas date parser can recognize dates that are in DD-MMM-YY format?
The following worked for me
I suspect yours is probably much simpler to parse because they are many tab separated? (I did an exact width parsing which is not trivial)
In [41]: df = pd.read_fwf(StringIO(data),widths=[9,13],parse_dates=True,index_col=0,names=['sample'],header=None,skiprows=1)
In [42]: df
Out[42]:
sample
2009-08-01 Sample 1
2009-08-02 Sample 2
Tab separated is much simpler
In [43]: data2 = """Data\tSample\n01-AUG-09\tSample 1\n02-Aug-09\tSample 2\n"""
In [44]: read_csv(StringIO(data2),sep='\t',parse_dates=True,index_col=0)
Out[44]:
Sample
Data
2009-08-01 Sample 1
2009-08-02 Sample 2