Why is the length of the SAS character field 32,767? - sas

According to The Little SAS Book, SAS character data can be up to 2^(15)-1 in length.
Where does that 1 character go? Usually in floating point arithmetic, we reserve one byte for the sign of the floating point number. Is something similar happening for SAS character data?

I don't have a definite answer, but I have a supposition.
I think that the length of 32,767 is not related to the field itself; SAS stores all of its rows (in an uncompressed file) in identical sized blocks, and so there is no need for a field length indicator or a null terminator. IE, in a SAS dataset you would have something like, for the following data step equivalent:
data want;
length name $8;
input recnum name $ age;
datalines;
01 Johnny 13
02 Nancy 12
03 Rachel 14
04 Madison 12
05 Dennis 15
;;;;
run;
You'd have something like this. The headers are of course not written that way but are just packed sequences of bytes.
<dataset header>
Dataset name: Want
Dataset record size: 24 bytes
... etc. ...
<subheaders>
Name character type length=8
Recnum numeric type length=8
Age numeric type length=8
... etc. ...
<first row of data follows>
4A6F686E6E792020000000010000000D
4E616E6379202020000000020000000C
52616368656C2020000000030000000E
4D616469736F6E20000000040000000C
44656E6E69732020000000050000000F
<end of data>
The variables run directly into each other, and SAS knows where one starts and one stops from the information in the subheaders. (This is just a PUT statement of course; I think in the actual file the integers are stored first, if I remember correctly; but the idea is the same.)
Technically the .sas7bdat specification is not a publicly disclosed specification, but several people have worked out most of how the file format works. Some R programmers have written a specification which while a bit challenging to read does give some information.
It denotes that 4 bytes are used to specify the field length, which is more than enough for 32767 (it's enough for 2 billion), so this isn't the definite answer; I suppose it may have originally been 2 bytes and changed to 4 at some later point in the development of SAS, though .sas7bdat was a totally new filetype created relatively recently (version 7, hence sas7bdat; we're on 9 now).
Another possibility, and perhaps the more likely one, is that before 1999 the ANSI C standard only required C compilers to support objects to a minimum of 32767 bytes - meaning a compiler didn't have to support arrays larger than 32767 bytes. While many of them did support much larger arrays/objects, it's possible that SAS was working with the minimum standard to avoid issues with different OS and hardware implementations. See this discussion of the ANSI C standards for some background. It's also possible another language's limitations (as SAS uses several different ones) of a similar nature are at fault here. [Credit to FriedEgg for the beginning of this idea (offline).]

Related

Maximum number of arguments [duplicate]

I came across some Fortran 90 code where 68 arguments are passed to a function.
Upon searching the web I only found something about a limit of passing 256 bytes for some CUDA Fortran related stuff (http://www.pgroup.com/userforum/viewtopic.php?t=2235&sid=f241ca3fd406ef89d0ba08a361acd962).
So I wonder: is there a limit to the number of arguments that may be passed to a function for Intel/Visual/GNU fortran compilers?
I came across this discussion of the Fortran 90 standards:
http://www.nag.co.uk/sc22wg5/Guidelines_for_Bindings-b.html
The relevant section is in italics below:
3.4 Statements
Constraints on the length of a Fortran statement impose an upper limit on the length of a procedure call. Although it is unlikely to be a serious imposition, a single Fortran statement in free-form format is subject to an upper limit of 5241 characters, including a possible label (F90 sections 3.3.1, 3.3.1.4); in fixed-form format the upper limit is 1320 characters plus a possible label (F90 section 3.3.2). These limits are subject to the characters being of "default kind", that is in practice single-byte characters; for double-byte characters, used for example for ideographic languages, the limits are processor-dependent but are likely to be of the same order of size.
There is no limit in the Fortran Standard on the maximum number of arguments to a procedure.
I have a security tool that I am developing in my research that analyzes binaries. I knew that the C language has limits on number of arguments (31 in the C90 standard, 127 in the C99 standard), so I thought that I could dimension a vector to hold 128 items pertaining to incoming arguments. I encountered a FORTRAN-derived binary that had 290 arguments passed, which led me to this discussion. The binary is from the SPEC CPU2006 benchmark suite, benchmark 481.wrf, where I see a procedure that is named (in the binary) "solve_interface_" which sets up 290 arguments on the stack and then calls "solve_em_" which actually processes these arguments. You can no doubt find the FORTRAN source code for these procedures online. The binary was produced by the GNU compiler tools for an x86/Linux/ELF system.
I don't believe that the Fortran standards explicitly impose such a limit. However, they do place a limit on the length of a line of code (132 characters) and the number of lines which can together form a single statement (256). I'll leave it to you to figure out how many arguments you could use in a single call to a routine.
Many compilers on the market have more relaxed limits on both statement length and the number of continuation lines that can be used. However, it would not surprise me if a compiler did impose a maximum number of arguments for any routine but the number is probably higher than any realistic need requires.

Types bit length and architecture specific implementations

I'm doing stuff in C++ but lately I've found that there are slight differences regarding how much data a type can accomodate and also the byte order is an issue.
Suppose I got a binary file, where I've encoded shorts that are 2 bytes in size. The file is in binary format like:
FA C8 - data segment 1
BA 32 - data segment 2
53 56 - data segment 3
Now all is well up to this point. Now I want to read this data. There are 2 problems:
1 what data type to choose to store this values?
2 how to deal with endianness of the target architecture?
The first problem is actually related to the second because here I will have to do bit shifts in order to swap the order of bytes.
I know that I could read the file byte by byte and add every two bytes. But is there an approach that could ease that pain?
I'm sorry If I'm being ambiguous. The problem is hard to explain. Hope you get a glimpse of what I'm talking about. I just want to store this data internally.
So I would appreciate some advices or if you can share some of your experience in this topic.
If you use big endian on the file that stores the data then you could just rely on htons(), htonl(), ntohs(), ntohl() to convert the integers to the right endianess before saving or after reading.
There is no easy way to do this.
Rather than doing that yourself, you might want to look into serialization libraries (for example Protobuf or boost serialization), they'll take care of a lot of that for you.
If you want to do it yourself, use fixed-width types (uint32_t and the like from <cstdint>), and endian conversion functions as appropriate. Either have a "prefix" in your file that determines what endianness it contains (a BOM/Byte Order Mark), or always store in either big or little endian, and systematically convert.
Be extra careful if you need to serialize strings, they have encoding problems of their own too.

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Write a program that takes text as input and produces a program that reproduces that text

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
Write a program, that reads a text from input and prints some other
program on output. If we compile and run the printed program, it must
output the original text.
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.
Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
20 + 21 + 22 + ... + 2n - 1
Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!
If we are talking about ASCII text...
I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).
People here are saying that the string cannot be compressed, yet it can.
Why?
Requirement: OUTPUT THE ORIGINAL TEXT
Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.
Take this for example:
ASCII values characters
0x00 .. 0x08 NUL, (other control codes)
0x09 .. 0x0D (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F (other control codes)
... rest of printable characters
Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F).
You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.
Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.
Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.
Of course this is just a conceptual answer, not a functional one.
I don't know if I could ever achieve this.
But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.
The real problem in your problem is understanding the problem and the requirements.
What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.
Pseudocode:
string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;
Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.
It's called a file archiver producing self-extracting archives.

Importing data from file to array

I have 2 dimensional table in file, which look like this:
11, 12, 13, 14, 15
21, 22, 23, 24, 25
I want it to be imported in 2 dimensional array. I wrote this code:
INTEGER :: SMALL(10)
DO I = 1, 3
READ(UNIT=10, FMT='(5I4)') SMALL
WRITE(UNIT=*, FMT='(6X,5I4)') SMALL
ENDDO
But it imports everything in one dimensional array.
EDIT:
I've updated code:
program filet
integer :: reason
integer, dimension(2,5) :: small
open(10, file='boundary.inp', access='sequential', status='old', FORM='FORMATTED')
rewind(10)
DO
READ(UNIT=10, FMT='(5I4)', iostat=reason) SMALL
if (reason /= 0) exit
WRITE(UNIT=*, FMT='(6X,5I4)') SMALL
ENDDO
write (*,*) small(2,1)
end program
Here is output:
11 12 13 14 15
21 22 23 24 25
12
Well, you have defined SMALL to be a 1-D array, and Fortran is just trying to be helpful. You should perhaps have defined SMALL like this;
integer, dimension(2,5) :: small
What happened when the read statement was executed was that the system ran out of edit descriptor (you specified 5 integers) before either SMALL was full or the end of the file was encountered. If I remember rightly Fortran will re-use the edit descriptor until either SMALL is full or the end-of-file is encountered. But this behaviour has been changed over the years, according to Fortran standards, and various compilers have implemented various non-standard features in this part of the language, so you may need to check your compiler's documentation or do some more experiments to figure out exactly what happens.
I think your code is also a bit peculiar in that you read from SMALL 3 times. Why ?
EDIT: OK, we're getting there. You have just discovered that Fortran stores arrays in column-major order. I believe that most other programming languages store them in row-major order. In other words, the first element of your array is small(1,1), the second (in memory) is small(2,1), the third is small(1,2) and so forth. I think that your read (and write) statements are not standard but widely implemented (which is not unusual in Fortran compilers). I may be wrong, it may be standard. Either way, the read statement is being interpreted to read the elements of small in column-major order. The first number read is put in small(1,1), the second in small(2,1), the third in small(1,2) and so on.
Your write statement makes use of the same feature; you might have discovered this for yourself if you had written out the elements in loops with the indices printed too.
The idiomatic Fortran way of reading an array and controlling the order in which elements are placed into the array, is to include an implied-do loop in the read statement, like this:
READ(UNIT=10, FMT='(5I4)', iostat=reason) ((SMALL(row,col), col = 1,numCol), row=1,numRow)
You can also use this approach in write statements.
You should also study your compiler documentation carefully and determine how to switch on warnings for all non-standard features.
Adding to what High Performance Mark wrote...
If you want to use commas to separate the numbers, then you should use list-directed IO rather than formatted IO. (Sometimes this is called format-free IO, but that non-standard term is easy to confuse with binary IO). This is easier to use since you don't have to arrange the numbers precisely in columns and can separate them with spaces or commas. The read is simply "read (10, *) variables"
But sticking to formatted IO, here is some sample code:
program demo1
implicit none
integer, dimension (2,5) :: small
integer :: irow, jcol
open ( unit=10, file='boundary.txt', access='sequential', form='formatted' )
do irow=1, ubound (small, 1)
read (10, '(5I4)') (small (irow, jcol), jcol=1, ubound (small, 2))
end do
write (*, '( / "small (1,2) =", I2, " and small (2,1)=", I2 )' ) small (1,2), small (2,1)
end program demo1
Using the I4 formatted read, the data need to be in columns:
12341234123412341234
11 12 13 14 15
21 22 23 24 25
The data file shouldn't contain the first row "1234..." -- that is in the example to make the alignment required for the format 5I4 clear.
With my example program, there is an outer do loop for irow and an "implied do loop" as part of the read statement. You could also eliminate the outer do loop and use two implied do loops on the read statement, as High Performance Mark showed. In this case, if you kept the format specification (5I4), it would get reused to read the second line -- this is called format reversion. (On a more complicated format, one needs to read the rules to understand which part of the format is reused in format reversion.) This is standard, and has been so at least since FORTRAN 77 and probably FORTRAN IV. (Of course, the declarations and style of my example are Fortran 90).
I used "ubound" so that you neither have to carry around variables storing the dimensions of the array, nor use specific numeric values. The later method can cause problems if you later decide to change the dimension of the array -- then you have to hunt down all of the specific values (here 2 and 5) and change them.
There is no need for a rewind after an open statement.