Is the md5 function safe to use for merging datasets? - sas

We are about to promote a piece of code which uses the SAS md5() hash function to efficiently track changes in a large dataset.
format md5 $hex32.;
md5=md5(cats(of _all_));
As per the documentation:
The MD5 function converts a string, based on the MD5 algorithm, into a 128-bit hash value. This hash value is referred to as a message digest (digital signature), which is nearly unique for each string that is passed to the function.
At approximately what stage does 'nearly unique' begin to pose a data integrity risk (if at all)?

I have seen an example where the md5 comparison goes wrong.
If you have the values "AB" and "CD" in the (two columns of the) first row and "ABC" and "D" in the second row, they got the same md5 value. See this example:
data md5;
attrib a b length=$3 informat=$3.;
infile datalines;
input a b;
format md5 $hex32.;
md5=md5(cats(of _all_));
datalines;
AB CD
A BCD
;run;
This is, of course, because the CATS(of _all_) will concatinate and strip the variables (converting numbers to string using the "best" format), without a delimiter. If you use CAT instead , this will not happen because the leading and trailing blanks are not removed. This error is not very far fetched. If you have missing values, then this could occur more often. If, for example, you have a lot of binary values in text variables, some of which are missing, it could occur very often.
One could do this manually, adding a delimiter in between the values. Of course, you would still have the case when you have ("AB!" and "CD") and ("AB" and "!CD") and you use "!" as delimiter...

MD5 has 2^128 distinct values, and from what I've read at 2^64 different values (that's 10^20 or so) you begin to have a high likelihood of finding a collision.
However, as a result of how MD5 is generated, you have some risks of collisions from very similar preimages which only differ in as little as two bytes. As such, it's hard to say how risky this would be for your particular process. It's certainly possible for a collision to occur on as few as two messages. It's not likely. Does saving [some] computing time benefit you enough to outweigh a small risk?

Related

SAS - How to determine the number of variables in a used range?

I imagine what I'm asking is pretty basic, but I'm not entirely certain how to do it in SAS.
Let's say that I have a range of variables, or an array, x1-xn. I want to be able to run a program that uses the number of variables within that range as part of its calculation. But I want to write it in such a way that, if I add variables to that range, it will still function.
Essentially, I want to be able to create a variable that if I have x1-x6, the variable value is '6', but if I have x1-x7, the value is '7'.
I know that :
var1=n(of x1-x6)
will return the number of non-missing numeric variables.. but I want this to work if there are missing values.
I hope I explained that clearly and that it makes sense.
Couple of things.
First off, when you put a range like you did:
x1-x7
That will always evaluate to seven items, whether or not those variables exist. That simply evaluates to
x1 x2 x3 x4 x5 x6 x7
So it's not very interesting to ask how many items are in that, unless you're generating that through a macro (and if you are, you probably can have that macro indicate how many items are in it).
But the range x1--x7 or x: both are more interesting problems, so we'll continue.
The easiest way to do this is, if the variables are all of a single type (but an unknown type), is to create an array, and then use the dim function.
data _null_;
x3='ABC';
array _temp x1-x7;
count = dim(_temp);
put count=;
run;
That doesn't work, though, if there are multiple types (numeric and character) at hand. If there are, then you need to do something more complex.
The next easiest solution is to combine nmiss and n. This works if they're all numeric, or if you're tolerant of the log messages this will create.
data _null_;
x3='ABC';
count = nmiss(of x1-x7) + n(of x1-x7);
put count=;
run;
nmiss is number of missing, plus n is number of nonmissing numeric. Here x3 is counted with the nmiss group.
Unfortunately, there is not a c version of n, or we'd have an easier time with this (combining c and cmiss). You could potentially do this in a macro function, but that would get a bit messy.
Fortunately, there is a third option that is tolerant of character variables: combining countw with catx. Then:
data _null_;
x3='ABC';
x4=' ';
count = countw(catq('dm','|',of x1-x7),'|','q');
put count=;
run;
This will count all variables, numeric or character, with no conversion notes.
What you're doing here is concatenating all of the variables together with a delimiter between, so [x1]|[x2]|[x3]..., and then counting the number of "words" in that string defining word as thing delimited by "|". Even missing values will create something - so .|.|ABC|.|.|.|. will have 7 "words".
The 'm' argument to CATQ tells it to even include missing values (spaces) in the concatenation. The 'q' argument to COUNTW tells it to ignore delimiters inside quotes (which CATQ adds by default).
If you use a version before CATQ is available (sometime in 9.2 it was added I believe), then you can use CATX, but you lose the modifiers, meaning you have more trouble with empty strings and embedded delimiters.

String storage optimization

I'm looking for some C++ library that would help to optimize memory usage by storing similar (not exact) strings in memory only once. It is not FlyWeight or string interning which is capable to store exact objects/strings only once. The library should be able to analyze and understand that, for example, two particular strings of different length have identical first 100 characters, this substring should be stored only once.
Example 1:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch=test2"<br/>
in this case it is obvious that the only difference in these two strings is the last character, so it would be a great saving in memory if we hold only one copy of "http://www.contoso.com/some/path/app.aspx?ch=test" and then two additional strings "1" and "2"
Example 2:
std::string str1 = "http://www.contoso.com/some/path/app.aspx?ch1=test1"<br/>
std::string str2 = "http://www.contoso.com/some/path/app.aspx?ch2=test2"<br/>
this is more complicated case when there are multiple identical substrings : one copy of "http://www.contoso.com/some/path/app.aspx?ch", then two strings "1" and "2", one copy of "=test" and since we already have strings "1" and "2" stored we don't need any additional strings.
So, is there such a library? Is there something that can help to develop such a library relatively fast? strings are immutable, so there is no need to worry about updating indexes or locks for threadsafety
If strings have common prefix the solution may be - using radix tree (also known as trie) (http://en.wikipedia.org/wiki/Radix_tree) for string representation. So you can only store pointer to tree leaf. And get whole string by growing up to tree root.
hello world
hello winter
hell
[2]
/
h-e-l-l-o-' '-w-o-r-l-d-[0]
\
i-n-t-e-r-[1]
Here is one more solution: http://en.wikipedia.org/wiki/Rope_(data_structure)
libstdc++ implementation: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a00223.html
SGI documentation: http://www.sgi.com/tech/stl/Rope.html
But I think you need to construct your strings for rope to work properly. Maybe found longest common prefix and suffix for every new string with previous string and then express new string as concatenation of previous string prefix, then uniq part and then previous string suffix.
For example 1, what I can come up with is Radix Tree, a space-optimized version from Trie. I did a simple google and found quite a few implementations in C++.
For example 2, I am also curious about the answer!
First of all, note that std::string is not immutable and you have to make sure that none of these strings are accidentally modified.
This depends on the pattern of the strings. I suggest using hash tables (std::unordered_map in C++11). The exact details depend on how you are going to access these strings.
The two strings you have provided differ only after the "?ch" part. If you expect that many strings will have long common prefixes where these prefixes are almost of the same size. You can do the following:
Let's say the size of a prefix is 43 chars. Let s be a string. Then, we can consider s[0-42] a key into the hash table and the rest of the string as a value.
For example, given "http://www.contoso.com/some/path/app.aspx?ch=test1" the key would be "http://www.contoso.com/some/path/app.aspx?" and "ch=test1" would be the value. if the key already exists in the hash table you can just add the value to the collection of values associated with key. Otherwise, add the key/value pair.
This is just an example, what the key is and what the value is depend on how you are going to access these strings.
Also if all string have "=test" in them, then you don't have to store this with every value. You can just store it once and then insert it when retrieving a string. So given the value "ch1=test1", what will be stored is just "ch11". This depends on the pattern of the strings.

QString::compare() vs converting QString to numbers and then comparing

Is it faster to compare 2 QStrings containing numbers, or to convert those QStrings to numbers and then compare the numbers?
so which is faster?
QString str1,str2;
if(str1.compare(str2)==0)
OR
QString str1,str2;
if(QString::number(str1)==QString::number(str2))
The reason I'm asking is because I have to fill a QMap with error codes and error messages corresponding to those error codes. I'll be reading the error code / error message from an ini file, so I'm wondering whether it's better to convert the error codes to integers and have QMap<int,QString> or to just keep them as QStrings and have QMap<QString,QString>. Which approach will give me the most optimal code?
Where the QMap contains <error code, error message>
String comparison is likely to end with trouble: "1.00" != "1.0" != "1" != "0001"
Always use numeric types for comparing numbers, and don't worry about imagined performance issues of such a minuscule piece of any whole.
For a one time use just comparing the string will (probably) be faster than converting them to numbers and comparing the numbers.
If you need the result as a number for other steps then convert them to numbers at the start and store numbers.
If you error codes are contiguous then you would typically but them into a vector indexed by [error_code - first_error_code]
BUT before doing any optimisation - 1, measure 2, decide if you care
In the case of the code you've written, doing two conversions and comparing the results is going to be slower than comparing the strings directly.
The thing is, to do a string comparison, you must at worst visit each character of each string. In the != case, you may visit fewer characters before finding the diff and exiting the compare (I assume a compare routine that exits early on fail). In the convert and compare case, you MUST visit all characters of both strings, every time. So the direct compare case will be faster.
In the case of the maps, you'll want to use QString because you'll do the conversion once and do the compare many, many times. That means that the cost of the conversion will be swamped by the savings from the comparisons and you'll win in the end.
With QString keys, the map is performing string comparisons on every insertion, deletion and lookup. Since those comparisons are done repeatedly, it's cheaper to convert the string to an integer before using it as a map key. Such a conversion then is only done once per item, and perhaps once per lookup the key for your lookup is originally in QString form as well.

Best R data structure to return table value counts

The following function returns a data.frame with two columns:
fetch_count_by_day=function(con){
q="SELECT t,count(*) AS count FROM data GROUP BY t"
dbGetQuery(con,q) #Returns a data frame
}
t is a DATE column, so output looks like:
t count(*)
1 2011-09-22 1438
...
All I'm really interested in is if any records for a given date already exist; but I will also use the count as a sanity check.
In C++ I'd return a std::map<std::string,int> or std::unordered_map<std::string,int> (*).
In PHP I'd use an associative array with the date as the key.
What is the best data structure in R? Is it a 2-column data.frame? My first thought was to turn the t column into rownames:
...
d=dbGetQuery(con,q)
rownames(d)=d[,1]
d$t=NULL
But data.frame rownames are not unique, so conceptually it does not quite fit. I'm also not sure if it makes using it any quicker.
(Any and all definitions of "best": quickest, least memory, code clarity, least surprise for experienced R developers, etc. Maybe there is one solution for all; if not then I'd like to understand the trade-offs and when to choose each alternative.)
*: (for C++) If benchmarking showed this was a bottleneck, I might convert the datestamp to a YYYYMMDD integer and use std::unordered_map<int,int>; knowing the data only covers a few years I might even use a block of memory with one int per day between min(t) and max(t) (wrapping all that in a class).
Contingency tables are actually arrays (or matrices) and can very easily be created.The dimnames hold the values and the array/matrix at its "core" holds the count data. The "table" and "tapply" functions are natural creators. You access the counts with "[" and use dimnames( ) followed by an "[" to get you the row annd column names. I would say it was wiser to use the "Date" class for dates than storing in "character" vectors.

Is this an acceptable use of "ASCII arithmetic"?

I've got a string value of the form 10123X123456 where 10 is the year, 123 is the day number within the year, and the rest is unique system-generated stuff. Under certain circumstances, I need to add 400 to the day number, so that the number above, for example, would become 10523X123456.
My first idea was to substring those three characters, convert them to an integer, add 400 to it, convert them back to a string and then call replace on the original string. That works.
But then it occurred to me that the only character I actually need to change is the third one, and that the original value would always be 0-3, so there would never be any "carrying" problems. It further occurred to me that the ASCII code points for the numbers are consecutive, so adding the number 4 to the character "0", for example, would result in "4", and so forth. So that's what I ended up doing.
My question is, is there any reason that won't always work? I generally avoid "ASCII arithmetic" on the grounds that it's not cross-platform or internationalization friendly. But it seems reasonable to assume that the code points for numbers will always be sequential, i.e., "4" will always be 1 more than "3". Anybody see any problem with this reasoning?
Here's the code.
string input = "10123X123456";
input[2] += 4;
//Output should be 10523X123456
From the C++ standard, section 2.2.3:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
So yes, if you're guaranteed to never need a carry, you're good to go.
The C++ language definition requres that the code-point values of the numerals be consecutive. Therefore, ASCII Arithmetic is perfectly acceptable.
Always keep in mind that if this is generated by something that you do not entirely control (such as users and third-party system), that something can and will go wrong with it. (Check out Murphy's laws)
So I think you should at least put on some validations before doing so.
It sounds like altering the string as you describe is easier than parsing the number out in the first place. So if your algorithm works (and it certainly does what you describe), I wouldn't consider it premature optimization.
Of course, after you add 400, it's no longer a day number, so you couldn't apply this process recursively.
And, <obligatory Year 2100 warning>.
Very long time ago I saw some x86 processor instructions for ASCII and BCD.
Those are AAA (ASCII Adjust for Addition), AAS (subtraction), AAM (mult), AAD (div).
But even if you are not sure about target platform you can refer to specification of characters set you are using and I guess you'll find that first 127 characters of ASCII is always have the same meaning for all characters set (for unicode that is first characters page).