I have a csv file with two identical columns:
X,X
0,0
1,1
2,2
I would like to import this into Stata 13, but it does not like importing the second X (since the names are the same):
. import delimited "filename.csv"
X already defined
Error creating variables
r(109);
Is there a simple way to force the import?
I do not want to specify the rows to import. The actual dataset has 100+ variables, and the duplicated variables are distributed throughout.
Similarly, I do not want to manually rename the variables.
I am fine if Stata wants to either drop or rename the second X.
As background, this csv file is being generated by some sloppy SQL code. The duplicated variables are precisely the variables I use for the joins. I could clean up the SQL code or pre-clean (with e.g. Python), but I would ideally like to have Stata force the import.
Try insheet.
With this example data in a .csv file:
x,x,y,y
238965,586,127,192864
238965,586,127,192864
1074,198264,5186,2947
1074,198264,5186,2947
All variables are imported and the resulting names in Stata are:
x
v2
y
v4
The command would be:
insheet using "~/some/file.csv"
(I'm on Stata 12.1 and according to the Stata 13 [U] manual, insheet is superseded by import delimited, p.21.)
import delimited was patched for this particular problem in the 07oct2013 update. To update Stata 13 type...
. update all
in the Stata Command window.
Related
I imported an SPSS (.sav file) into SAS. Several of the variables are not showing up as they are named things like 'variable___1.1' When i try to KEEP certain variables in a data step, I get an error because these variables create an error as SAS misinterprets the '.'
Has anyone encountered this before or know a way around it?
I can see the problem variables and their values in the .sas7bdat file, so the data imported, I just need to find a way to change the variable name so I can include it in the report.
You use name literal notation, 'variable1_1'n in your code, e.g.
rename 'variable___1.1'n = variable1_1;
Or set this option and reimport your data so that you get better names.
option validvarname=v7;
That will tell SAS to import the data with simpler variable names. Note that I'm not sure if that's two underscores or three or four in the variable name....guessing at 3.
What is the difference between INFILE statement and PROC IMPORT statement and which one is better to use?
PROC IMPORT is a procedure. Like PROC MEANS or PROC PRINT.
INFILE is a data step statement. Like INPUT, PUT, IF, etc.
You can use PROC IMPORT to convert data in various forms into SAS datasets. It can read from database files, spreadsheets and text files. It is really only for text files that there is any comparison with running your own DATA step (with an INFILE statement as one of the many statements in that step) that makes sense in this case.
If you use PROC IMPORT to read a delimited file then the procedure will make decisions about how to interpret the file. It will guess how many variables there are and what type to use for them. It will guess if any informats need to be used to properly read the text into data. It will convert the header line to variable names. Depending on the data it can sometimes work well and sometimes do really dumb things.
If you write your own DATA step then you have full control over what variables are created and how to read them. This might take a little more code or effort than letting a procedure guess for you. But if you have to read many similar files then using a data step will let you have the control needed to make sure that the individual datasets are created consistently.
as Tom speaked about difference between PROC Import and infile. Also infile supports libref but proc import doesnot. The proc import doesnot support statements for naming the variables name similar to data set. The names length types and other attributes are named automatically.
I have a macro that I use to import Excel files from a Windows directory to SAS (version 9.3) on a Linux server. In general the macro has worked fine, but now I'm trying to import an Excel file with a column that contains mostly numeric data with some character records thrown in.
The variable looks something like this:
Var2
1111111
2222222
3333333
4444444
Multiple
5555555
H6666-01
The variable is getting read in as numeric so I'm losing the data in the fifth and seventh records. I've tried a few of the suggestions listed in this answer, but nothing seems to change the variable type.
Here's a portion of the macro I have:
proc import replace
out=&d_set
dbms=excelcs
file="\\path\to\file\&xlsx_nm";
sheet="&sheet_nm";
server="Server";
port=0000;
serveruser="&sysget_USER";
serverpassword="&pw";
range="&rng";
DBDSOPTS = "DBTYPE=(Var2='CHAR(8)')";
run;
I just added the statement DBDSOPTS = "DBTYPE=(Var2='CHAR(8)')"; based on the suggestion on the link above, but the output in the log did not change.
I have also tried padding the original Excel file with a "dummy" record (which I'd like to avoid) with character data in the column that I'm having issues with, but this also did not work.
I'd like to solve this in the import procedure but I'm open to other suggestions.
In SAS, while creating a SAS data set from a raw data file (csv), we can either use the DATA step with the infile keyword or the PROC IMPORT step.
What are the advantages and disadvantages of each over the other?
Proc Import makes assumptions about the lengths of character variables and types of variables based on reading a number of rows in the CSV which is controlled by an option. If you issue a recall command in interactive sas after running proc import you get the data step code that proc import generated to do the actual work. It generates format and informat statements that may or may not be exactly what you want.
I often use proc import as a data step code generator, recall the code, and then modify it to suit what I want.
You can also add other processing logic to extend the functionality of the step beyond simply reading the source data into a data set. Creating new variables as transformations of one or more of the columns in the CSV springs to mind.
I generally agree this is too broad a question. That said:
PROC IMPORT is slower than a DATA STEP. This is because PROC IMPORT looks at the file and then writes and executes a DATA STEP.
A DATA STEP requires you to know the name, position, and attributes (type, length, etc) for each variable.
If I need to read a file once, I just use PROC IMPORT.
If I need to read a file multiple times, I don't care about speed, and the file format might change, then I use PROC IMPORT.
If I am in a production system where speed matters and I want an ERROR if the format changes, then I use PROC IMPORT. But I take the DATA STEP it writes for me and put that into my code.
If PROC IMPORT fails to guess my columns correctly, I use PROC IMPORT, modify the DATA STEP it produces, and then use that.
I have no working knowledge of SAS, but I have an excel file that I need to import and work with. In the excel file there are about 100 rows (observations) and 7 columns (quantities). In some cases, a particular observation may not have any data in one column. I need to completely ignore that observation when reading my data into SAS. I'm wondering what the commands for this would be.
An obvious cheap solution would be to delete the rows in the excel file with missing data, but I want to do this with SAS commands, because I want to learn some SAS.
Thanks!
Import the data however you want, for example with the IMPORT procedure, as Stig Eide mentioned.
proc import
datafile = 'C:\...\file.xlsx'
dbms = xlsx
out = xldata
replace;
mixed = YES;
getnames = YES;
run;
Explanation:
The DBMS= option specifies how SAS will try to read the data. If your file is an Excel 2007+ file, i.e. xlsx, then you can use DBMS=XLSX as shown here. If your file is older, e.g. xls rather than xlsx, try DBMS=EXCEL.
The OUT= option names the output dataset.
If a single level name is specified, the dataset is written to the WORK library. That's the temporary library that's unique to each SAS session. It gets deleted when the session ends.
To create a permanent dataset, specify a two level name, like mylib.xldata, where mylib refers to a SAS library reference (libref) created with a LIBNAME statement.
REPLACE replaces the dataset created the first time you run this step.
MIXED=YES tells SAS that the data may be of mixed types.
GETNAMES=YES will name your SAS dataset variables based on the column names in Excel.
If I understand you correctly, you want to remove every observation in the dataset that has a missing value in any of the seven columns. There are fancier ways to do this, but I recommend a simple approach like this:
data xldata;
set xldata;
where cmiss(col1, col2, ..., col7) = 0;
run;
The CMISS function counts the number of missing values in the variables you specify at each observation, regardless of the data type. Since we're using WHERE CMISS()=0, the resulting dataset will contain only the records with no missing data for any of the seven columns.
When in doubt, try browsing the SAS online documentation. It's very thorough.
If you have "SAS/ACCESS Interface to PC Files" licensed (hint: proc setinit) you can import the Excel file with this code. The where option lets you select which rows you want to keep, in this example you will keep the rows where the column "name" is not blank:
proc import
DATAFILE="your file.xlsx"
DBMS=XLSX
OUT=resulttabel(where=(name ne ""))
REPLACE;
MIXED=YES;
QUIT;