You’re attempting to read a raw data file and you see the following messages displayed in the SAS Log:
NOTE: Invalid data for Salary in line 4 15-23.
RULE: ----|----10---|----20---|----30---|----40---|----50-
4 120104 F 46#30 11MAY1954 33
Employee_Id=120104 employee_gender=F Salary=. birth_date=-2061 _ERROR_=1 _N_=4
NOTE: 20 records were read from the infile ‘c:employees.dat’.
The minimum record length was 33.
The maximum record length was 33.
NOTE: The data set WORK.EMPLOYEES has 20 observations and 4 variables
What does it mean?
A. A compiler error, triggered by an invalid character for the variable Salary.
B. An execution error, triggered by an invalid character for the variable Salary.
C. The 1st of potentially many errors, this one occurring on the 4th observation.
D. An error on the INPUT statement specification for reading the variable Salary.
Looking at the problem:
NOTE: Invalid data for Salary in line 4 15-23.
That is the note you get when you have an input statement from a file or datalines, and you are expecting a numeric value but encounter a nonnumeric value that cannot be read into that field (or otherwise something that doesn't match the expected informat).
D. An error on the INPUT statement specification for reading the variable Salary.
That seems like the best answer to me, depending on how you parse the answer text.
(A) refers to compiler errors, which mean they occur before any data is read in - it's certainly not that, data is where the problem is.
(B) is the other possible answer; it is execution time certainly, and it is indeed caused by an invalid character in the data, but I don't like how that answer is worded and think it's not clear.
(C) is wrong because this is the only error you see...
(D) is the most accurate, I believe, if you assume your data is right anyway. It's possible though that the Input statement is right and your data is bad; in that case it would point to (B) being the right answer.
Related
I'm trying to import an arff file into weka and I am continually getting the following error
Unable to determine structure as arff (Reason: java.io.IOException: } expected at end of enumeration, read Token[EOL], line 20
The closing bracket } is present and I can't find any other errors with line 20. In fact the error reappears after I've deleted line 20. I've attached a link to the arff with a couple lines of data: Link
I see three issues here. FIRST: In the file you linked to, the last attribute is los_category, which is NUMERIC:
#ATTRIBUTE los_category NUMBERIC
But the last variable in your data line is clearly not numeric (Low).
21,22,165315,4/9/2196__12:26:00_PM,4/10/2196__3:54:00_PM,?,EMERGENCY,EMERGENCY_ROOM_ADMIT,DISC-TRAN_CANCER/CHLDRN_H,Private,?,UNOBTAINABLE,MARRIED,WHITE,4/9/2196__10:06:00_AM,4/9/2196__1:24:00_PM,BENZODIAZEPINE_OVERDOSE,0,1,1.144444444,Low
You've defined with #ATTRIBUTE statements 20 variables (lines 3-22) but in fact your data lines have 21 variables.
SECOND, you have time variables (e.g. admittime) as numeric; but they clearly have non-numeric characters. I know there's a specific format that ARFF files want date/time in, but I'm not an expert in that and can't be definitive about a fix. This is definitely a problem, though. When I create a file with just your first 3 variables, it loads fine. When I add the fourth (#ATTRIBUTE admittime NUMERIC) I get the same error as you report.
THIRD, that line 19 (#ATTRIBUTE diagnosis) is hundreds of characters long. You might want to treat that as a STRING variable type for now, just to be sure you aren't overloading the read buffer with that huge line.
The local system datetime is 10:34 PM 1/8/2021.
In Stata I write
local datestamp: di %tdCCYY-NN-DD daily("S_DATE","DMY")
display `datestamp'
and the output is 2012
If I write
di %tdCCYY-NN-DD daily("S_DATE","DMY")
I get 2021-01-08
Why the discrepancy? This is puzzling to me. I clearly assigned datestamp yet when I display it obviously something is wrong.
Executive summary: display saw 2021-01-08 and evaluated it as a expression in numbers. 2021 - 1 - 8 = 2012, so 2012 was what you saw.
This is a subtle question, but the answer will show Stata's perfect logic, by its own rules.
The code as posted in the question omits the crucial $ sign before S_DATE, which indicates a global macro, specifically a system macro containing the current daily date, obtained from your operating system.
It is now 9 January 2021 in my time zone, but my example will work as well as yours to show what is going on. You defined a local macro, and then you included a reference to that local macro in a call to display. The display command has a designed inclination to calculate the result of any expression it sees before it displays the result of that calculation.
Taking this more slowly: There are two quite distinct steps to the interpretation of your display command. First, as a matter of interpreting any Stata command line, all references to local and global macros are replaced with the contents of those macros (if they exist; it is not an error to refer to a macro that does not exist, but that is not an issue here). Second, display evaluates any expression it sees and then displays the result of that expression. Despite its name, display is not designed to show you directly any macro that exists, although that is what happens if the result of evaluating it leaves it the same as when it was presented. Thus if a local macro contains the string foo, that is what display will show you -- unless foo is the name of a scalar or variable, in which case the name won't be shown, just the values of that scalar or that variable (in the first observation, in the latter case).
The command to see exactly what is inside a macro, without interpretation or calculation, is macro list.
To the point, consider the different results here. In the first display command, the quotation marks " " are functional, not ornamental, and instruct display to treat its input as a string. Without the quotation marks, display is inclined to treat what it sees as numeric, and here it sees an expression, 2021 MINUS 1 MINUS 9, which evaluates to 2011. The leading zeros are ignored. In your case your date was 2021-01-08 and the result was 2012, as you reported.
. local datestamp: di %tdCCYY-NN-DD daily("$S_DATE","DMY")
. di "`datestamp'"
2021-01-09
. di `datestamp'
2011
You get the right answer with the last statement in your question. You fed display a number but instructed it to use a daily date display format to interpret that number, and you got exactly what you asked for and you expected. 22288 is, or was, 8 January 2021 on scale with origin 0 at 1 January 1960.
I googled about "type mismatch", and it seems the errors mostly come from "replace"
Indeed I am doing some replacing but I can't see where that error comes from.
generate price=0.0
replace price=105.17 if year==2014
gen crisis=1 if year==2008 | year==2009
replace crisis=0 if year<2008 | year>2009
gen postcrisis=1 if year>2008
replace postcrisis=0 if year<=2008
Also, Stata isn't displaying at which line the error happened. This is very bad for debugging. How can I make it?
======================================
The error was coming from
generate realsales=sales/price
To see what is going wrong, I did the following.
. describe sales price
storage display value
variable name type format label variable
> label
------------------------------------------------------
sales str8 %9s
price float %9.0g
And destring didn't work.
. destring sales, replace
sales contains nonnumeric characters; no replace
Also, dataex didn't work.
. dataex
input statement exceeds linesize limit. Try specifying fewer variables
And still, when Stata stops with an error, it never tells me which line is causing the error. It simply shows me the following lines.
112.
. }
(146 vars, 10748 obs)
type mismatch
r(109);
end of do-file
r(109);
This is very inconvenient for debugging. Is it really like this? Is there any way to make Stata display the error line?
In turn as you tell us nothing about your variables, this isn't a reproducible example.
A type mismatch means that you trying to do something numeric to strings, or vice versa. In your examples, possibly year is a string variable somehow. If so,
destring year, replace
On debugging: Stata will stop with an error message as soon as it hits a problem. Otherwise, help trace to find out about program tracing.
Your example statements could all be condensed. In the last example, if crisis years are 2008 and 2009, you don't mean what you say.
generate price = cond(year == 2014, 105.17, 0)
gen crisis = year==2008 | year==2009
gen postcrisis = year > 2009
I've been given the challenge to port a Fortran 77 program into C#.
I've found out that read(5,*) read from the standard input, i.e. the keyboard.
Now I'm trying to understand how the following works:
1. When I run the program, I have to run it as cheeseCalc<blue.dat>output.txt
, which read a blue.dat file and produces a output.txt file. How does read work in this case?
In the same program, there is READ(5,* )IDUM and later it also has read(5,*)idum,idum,tinit. What is happening in this case?
The blue.dat file has the following lines:
HEAD make new cake
INPUT VARIABLES
MFED MASS-FEED 30 ;1001 1 100 PEOPLE TO FEED
TOVE TEMP-IN-OVEN 150.0 ;1001 20 100 TEMPERATURE OF OVEN, C
UPDATED: Just for context, the initial lines of code in the program are:
program cheeseCalc
CHARACTER*76 IDENT
CHARACTER*1 IDUM
READ(5,104)IDENT
104 FORMAT(4X,A)
READ(5,*)IDUM
c write start record
write(6,102)IDENT
102 format('**START',/,4X,A,/)
read(5,*)idum,idum,frate
110 format(f10.0)
frate2=frate/3.6
read(5,*)idum,idum,tempo
* Do calculation *
write(6,*)frate2,tempo
end
Any help will be appreciated!! Thanks!
The full detail of the general read statement is documented elsewhere, but there is an idiom here which is perhaps worth elaborating on.
The statement read(5,*) ... is list-directed input from the external unit number 5. Let's assume (it's not guaranteed, but it's likely and you seem happy with that for your setup) that this external unit is standard input.
The idiomatic part is the repeated use of a single variable in an input list such as
read(5,*) idum, idum, ...
This (and the fact that idum is an (awfully named) length-1 character variable) signifies that the user doesn't care about the input in the first two fields). The first string, delimited by blanks, is read then the first character is assigned to idum. Then idum is immediately set to the first character of the next string.
The purpose of this is to set the place in the record to the third field, which is read into the (real) variable frate (in the first case).
Equally
read(5,*) idum
is just skipping the second line (strictly, reading the first character, but that's not used anywhere before the next read into idum): the first blank-delimited field is read but the next read moves on to the next line rather than continuing with that one.
I came across a little puzzle with Stata's locals, display, and quotes..
Consider this example:
generate var1 = 54321 in 1
local test: di %10.0gc var1[1]
Why is the call:
di "`test'"
returning
54,321
Whereas the call:
di `test'
shows
54 321
What is causing such behaviour?
Complete the sequence with
(1)
. di 54,321
54 321
(2)
. di "54,231"
54,321
display interprets (1) as an instruction to display two arguments, one by one. You get the same result with your last line as (first) the local macro test was evaluated and (second) display saw the result of the evaluation.
The difference when quotation marks are supplied is that thereby you insist that the argument is a literal string. You get the same result with your first display command for the same reasons as just given.
In short, the use of local macros here is quite incidental to the differences in results. display never sees the local macro as such; it just sees its contents after evaluation. So, what you are seeing pivots entirely on nuances in what is presented to display.
Note further that while you can use a display format in defining the contents of a local macro, that ends that story. A local does not have an attached format that sticks with it. It's just a string (which naturally may mean a string with numeric characters).