String.blit vs String.sub in OCaml - ocaml

Is it better to use String.blit or String.sub in OCaml? By better I mean more time- or memory-efficient or even just more idiomatic.
I.e. is it "better" to do:
let new_string = String.sub old_string 0 4;;
Or
String.blit old_string 0 new_string 0 4;; (* I guess new_string is a byte seq here. *)

As those two functions have different semantics, it depends on what you want to do.
String.blit could indeed be used to copy part of a string into a new fresh string, but I think you should not use it instead of sub. First, that would be bad when you switch to OCaml 4.02 and try to use Bytes. Then, it makes your code way less clear (and you'd also have to add a string creation).
Also, note that blit is an imperative feature, whereas sub is itself functional. So it mainly depend on your personal programing style. In matters of performance though, they are quite comparable. That may change if the dev team decides to use a different representation for constant strings (no hurry though, it won't happen in the next few years).

Both of them call the same function unsafe_blit internally. I would use whichever makes your code the clearest.

String.blit allows reusing the same buffer or working with parts of buffers. If you don't need those things, just use sub.
In terms of complexity, both operations are linear in the size of the substring/interval.

Related

Should function composition and piping be tested?

In F# (and most of the functional languages) some codes are extremely short as follows:
let f = getNames
>> Observable.flatmap ObservableJson.jsonArrayToObservableObjects<string>
or :
let jsonArrayToObservableObjects<'t> =
JsonConvert.DeserializeObject<'t[]>
>> Observable.ToObservable
And the simplest property-based test I ended up for the latter function is :
testList "ObservableJson" [
testProperty "Should convert an Observable of `json` array to Observable of single F# objects" <| fun _ ->
//--Arrange--
let (array , json) = createAJsonArrayOfString stringArray
//--Act--
let actual = jsonArray
|> ObservableJson.jsonArrayToObservableObjects<string>
|> Observable.ToArray
|> Observable.Wait
//--Assert--
Expect.sequenceEqual actual sArray
]
Regardless of the arrange part, the test is more than the function under test, so it's harder to read than the function under test!
What would be the value of testing when it's harder to read than the production code?
On the other hand:
I wonder whether the functions which are a composition of multiple functions are safe to not to be tested?
Should they be tested at integration and acceptance level?
And what if they are short but do complex operations?
Depends upon your definition of what 'functional programming' is. Or even more precise - upon how close you wanna stay to the origin of functional programming - math with both broad and narrow meanings.
Let's take something relevant to the programming. Say, mappings theory. Your question could be translated in such a way: having a bijection from A to B, and a bijection from B to C, should I prove that composition of those two is a bijection as well? The answer is twofold: you definitely should, and you do it only once: your prove is generic enough to cover all possible cases.
Falling back into programming, it means that pipe-lining has to be tested (proved) only once - and I guess it was before deploy to the production. Since that your job as a programmer to create functions (mappings) of such a quality, that, being composed with a pipeline operator or whatever else, the desired properties are preserved. Once again, it's better to stick with generic arguments rather than write tons of similar tests.
So, finally, we come down to a much more valuable question: how one can guarantee that some operation preserve some property? It turns out that the easiest way to acknowledge such a fact is to deal with types like Monoid from the great Haskell: for example, Monoind is there to represent any associative binary operation A -> A -> A together with some identity-element of type A. Having such a generic containers is extremely profitable and the best known way of being explicit in what and how exactly your code is designed to do.
Personally I would NOT test it.
In fact having less need for testing and instead relying more on stricter compiler rules, side effect free functions, immutability etc. is one major reason why I prefer F# over C#.
of course I continue (unit)testing of "custom logic" ... e.g. algorithmic code

Why would I ever want to use Maybe instead of a List?

Seeing as the Maybe type is isomorphic to the set of null and singleton lists, why would anyone ever want to use the Maybe type when I could just use lists to accomodate absence?
Because if you match a list against the patterns [] and [x] that's not an exhaustive match and you'll get a warning about that, forcing you to either add another case that'll never get called or to ignore the warning.
Matching a Maybe against Nothing and Just x however is exhaustive. So you'll only get a warning if you fail to match one of those cases.
If you choose your types such that they can only represent values that you may actually produce, you can rely on non-exhaustiveness warnings to tell you about bugs in your code where you forget to check for a given a case. If you choose more "permissive" types, you'll always have to think about whether a warning represents an actual bug or just an impossible case.
You should strive to have accurate types. Maybe expresses that there is exactly one value or that there is none. Many imperative languages represent the "none" case by the value null.
If you chose a list instead of Maybe, all your functions would be faced with the possibility that they get a list with more than one member. Probably many of them would only be defined for one value, and would have to fail on a pattern match. By using Maybe, you avoid a class of runtime errors entirely.
Building on existing (and correct) answers, I'll mention a typeclass based answer.
Different types convey different intentions - returning a Maybe a represents a computation with the possiblity of failing while [a] could represent non-determinism (or, in simpler terms, multiple possible return values).
This plays into the fact that different types have different instances for typeclasses - and these instances cater to the underlying essence the type conveys. Take Alternative and its operator (<|>) which represents what it means to combine (or choose) between arguments given.
Maybe a Combining computations that can fail just means taking the first that is not Nothing
[a] Combining two computations that each had multiple return values just means concatenating together all possible values.
Then, depending on which types your functions use, (<|>) would behave differently. Of course, you could argue that you don't need (<|>) or anything like that, but then you are missing out on one of Haskell's main strengths: it's many high-level combinator libraries.
As a general rule, we like our types to be as snug fitting and intuitive as possible. That way, we are not fighting the standard libraries and our code is more readable.
Lisp, Scheme, Python, Ruby, JavaScript, etc., manage to get along with just one type each, which you could represent in Haskell with a big sum type. Every function handling a JavaScript (or whatever) value must be prepared to receive a number, a string, a function, a piece of the document object model, etc., and throw an exception if it gets something unexpected. People who program in typed languages like Haskell prefer to limit the number of unexpected things that can occur. They also like to express ideas using types, making types useful (and machine-checked) documentation. The closer the types come to representing the intended meaning, the more useful they are.
Because there are an infinite number of possible lists, and a finite number of possible values for the Maybe type. It perfectly represents one thing or the absence of something without any other possibility.
Several answers have mentioned exhaustiveness as a factor here. I think it is a factor, but not the biggest one, because there is a way to consistently treat lists as if they were Maybes, which the listToMaybe function illustrates:
listToMaybe :: [a] -> Maybe a
listToMaybe [] = Nothing
listToMaybe (a:_) = Just a
That's an exhaustive pattern match, which rules out any straightforward errors.
The factor I'd highlight as bigger is that by using the type that more precisely models the behavior of your code, you eliminate potential behaviors that would be possible if you used a more general alternative. Say for example you have some context in your code where you uses a type of the form a -> [b], though the only correct alternatives (given your program's specification) are empty or singleton lists. Try as hard as you may to enforce the convention that this context should obey that rule, it's still possible that you'll mess up and:
Somehow a function used in that context will produce a list of two or more items;
And somehow a function that uses the results produced in that context will observe whether the lists have two or more items, and behave incorrectly in that case.
Example: some code that expects there to be no more than one value will blindly print the contents of the list and thus print multiple items when only one was supposed to be.
But if you use Maybe, then there really must be either one value or none, and the compiler enforces this.
Even though isomorphic, e.g. QuickCheck will run slower because of the increase in search space.

How do I handle combinations of behaviours?

I am considering the problem of validating real numbers of various formats, because this is very similar to a problem I am facing in design.
Real numbers may come in different combinations of formats, for example:
1. with/without sign at the front
2. with/without a decimal point (if no decimal point, then perhaps number of decimals can be agreed beforehand)
3. base 10 or base 16
We need to allow for each combination, so there are 2x2x2=8 combinations. You can see that the complexity increases exponentially with each new condition imposed.
In OO design, you would normally allocate a class for each number format (e.g. in this case, we have 8 classes), and each class would have a separate validation function. However, with each new condition, you have to double the number of classes required and it soon becomes a nightmare.
In procedural programming, you use 3 flags (i.e. has_sign, has_decimal_point and number_base) to identify the property of the real number you are validating. You have a single function for validation. In there, you would use the flags to control its behaviour.
// This is part of the validation function
if (has_sign)
check_sign();
for (int i = 0; i < len; i++)
{
if (has_decimal_point)
// Check if number[i] is '.' and do something if it is. If not, continue
if (number_base = BASE10)
// number[i] must be between 0-9
else if (number_base = BASE16)
// number[i] must be between 0-9, A-F
}
Again, the complexity soon gets out of hand as the function becomes cluttered with if statements and flags.
I am sure that you have come across design problems of this nature before - a number of independent differences which result in difference in behaviour. I would be very interested to hear how have you been able to implement a solution without making the code completely unmaintainable.
Would something like the bridge pattern have helped?
In OO design, you would normally
allocate a class for each number
format (e.g. in this case, we have 8
classes), and each class would have a
separate validation function.
No no no no no. At most, you'd have a type for representing Numeric Input (in case String doesn't make it); another one for Real Number (in most languages you'd pick a built-in type, but anyway); and a Parser class, which has the knowledge to take a Numeric Input and transform it into a Real Number.
To be more general, one difference of behaviour in and by itself doesn't automatically map to one class. It can just be a property inside a class. Most importantly, behaviours should be treated orthogonally.
If (imagining that you write your own parser) you may have a sign or not, a decimal point or not, and hex or not, you have three independent sources of complexity and it would be ok to find three pieces of code, somewhere, that treat one of these issues each; but it would not be ok to find, anywhere, 2^3 = 8 different pieces of code that treat the different combinations in an explicit way.
Imagine that add a new choice: suddenly, you remember that numbers might have an "e" (such as 2.34e10) and want to be able to support that. With the orthogonal strategy, you'll have one more independent source of complexity, the fourth one. With your strategy, the 8 cases would suddenly become 16! Clearly a no-no.
I don't know why you think that the OO solution would involve a class for each number pattern. My OO solution would be to use a regular expression class. And if I was being procedural, I would probably use the standard library strtod() function.
You're asking for a parser, use one:
http://www.pcre.org/
http://www.complang.org/ragel/
sscanf
boost::lexical_cast
and plenty of other alternatives...
Also: http://en.wikipedia.org/wiki/Parser_generator
Now how do I handle complexity for this kind of problems ? Well if I can, I reformulate.
In your case, using a parser generator (or regular expression) is using a DSL (Domain Specific Language), that is a language more suited to the problem you're dealing with.
Design pattern and OOP are useful, but definitely not the best solution to each and every problem.
Sorry but since i use vb, what i do is a base function then i combine a evaluator function
so ill fake code it out the way i have done it
function getrealnumber(number as int){ return getrealnumber(number.tostring) }
function getrealnumber(number as float){ return getrealnumber(number.tostring) }
function getrealnumber(number as double){ return getrealnumber(number.tostring) }
function getrealnumber(number as string){
if ishex(){ return evaluation()}
if issigned(){ return evaluation()}
if isdecimal(){ return evaluation()}
}
and so forth up to you to figure out how to do binary and octal
You don't kill a fly with a hammer.
I realy feel like using a Object-Oriented solution for your problem is an EXTREME overkill. Just because you can design Object-Oriented solution , doesn't mean you have to force such one to every problem you have.
From my experience , almost every time there is a difficulty in finding an OOD solution to a problem , It probably mean that OOD is not appropiate. OOD is just a tool , its not god itself. It should be used to solve large scale problems , and not problems such one you presented.
So to give you an actual answer (as someone mentioned above) : use regular expression , Every solution beyond that is just an overkill.
If you insist using an OOD solution.... Well , since all formats you presented are orthogonal to each other , I dont see any need to create a class for every possible combination. I would create a class for each format and pass my input through each , in that case the complexity will grow linearly.

Why is strncpy insecure?

I am looking to find out why strncpy is considered insecure. Does anybody have any sort of documentation on this or examples of an exploit using it?
Take a look at this site; it's a fairly detailed explanation. Basically, strncpy() doesn't require NUL termination, and is therefore susceptible to a variety of exploits.
The original problem is obviously that strcpy(3) was not a memory-safe operation, so an attacker could supply a string longer than the buffer which would overwrite code on the stack, and if carefully arranged, could execute arbitrary code from the attacker.
But strncpy(3) has another problem in that it doesn't supply null termination in every case at the destination. (Imagine a source string longer than the destination buffer.) Future operations may expect conforming C nul-terminated strings between equally sized buffers and malfunction downstream when the result is copied to yet a third buffer.
Using strncpy(3) is better than strcpy(3) but things like strlcpy(3) are better still.
To safely use strncpy, one must either (1) manually stick a null character onto the result buffer, (2) know that the buffer ends with a null beforehand, and pass (length-1) to strncpy, or (3) know that the buffer will never be copied using any method that won't bound its length to the buffer length.
It's important to note that strncpy will zero-fill everything in the buffer past the copied string, while other length-limited strcpy variants will not. This may at some cases be a performance drain, but in other cases be a security advantage. For example, if one used strlcpy to copy "supercalifragilisticexpalidocious" into a buffer and then to copy "it", the buffer would hold "it^ercalifragilisticexpalidocious^" (using "^" to represent a zero byte). If the buffer gets copied to a fixed-sized format, the extra data might tag along with it.
The question is based on a "loaded" premise, which makes the question itself invalid.
The bottom line here is that strncpy is not considered insecure and has never been considered insecure. The only claims of "insecurity" that can be attached to that function are the broad claims of general insecurity of C memory model and C language itself. (But that is obviously a completely different topic).
Within the realm of C language the misguided belief of some kind of "insecurity" inherent in strncpy is derived from the widespread dubious pattern of using strncpy for "safe string copying", i.e. something this function does not do and has never been intended for. Such usage is indeed highly error prone. But even if you put an equality sign between "highly error prone" and "insecure", it is still a usage problem (i.e. a lack of education problem) not a strncpy problem.
Basically, one can say that the only problem with strncpy is a unfortunate naming, which makes newbie programmers assume that they understand what this function does instead of actually reading the specification. Looking at the function name an incompetent programmer assumes that strncpy is a "safe version" of strcpy, while in reality these two functions are completely unrelated.
Exactly the same claim can be made against the division operator, for one example. As most of you know, one of the most frequently-asked questions about C language goes as "I assumed that 1/2 will evaluate to 0.5 but I got 0 instead. Why?" Yet, we don't claim that the division operator is insecure just because language beginners tend to misinterpret its behavior.
For another example, we don't call pseudo-random number generator functions "insecure" just because incompetent programmers are often unpleasantly surprised by the fact that their output is not truly random.
That is exactly how it is with strncpy function. Just like it takes time for beginner programmers to learn what pseudo-random number generators actually do, it takes them time to learn what strncpy actually does. It takes time to learn that strncpy is a conversion function, intended for converting zero-terminated strings to fixed-width strings. It takes time to learn that strncpy has absolutely nothing to do with "safe string copying" and can't be meaningfully used for that purpose.
Granted, it usually takes much longer for a language student to learn the purpose of strncpy than to sort things out with the division operator. However, this is a basis for any "insecurity" claims against strncpy.
P.S. The CERT document linked in the accepted answer is dedicated to exactly that: to demonstrate the insecurities of the typical incompetent abuse of strncpy function as a "safe" version of strcpy. It is not in any way intended to claim that strncpy itself is somehow insecure.
A pathc of Git 2.19 (Q3 2018) finds that it is too easy to misuse system API functions such as strcat(); strncpy(); ... and forbids those functions in this codebase.
See commit e488b7a, commit cc8fdae, commit 1b11b64 (24 Jul 2018), and commit c8af66a (26 Jul 2018) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit e28daf2, 15 Aug 2018)
banned.h: mark strcat() as banned
The strcat() function has all of the same overflow problems as strcpy().
And as a bonus, it's easy to end up accidentally quadratic, as each subsequent call has to walk through the existing string.
The last strcat() call went away in f063d38 (daemon: use
cld->env_array when re-spawning, 2015-09-24, Git 2.7.0).
In general, strcat() can be replaced either with a dynamic string
(strbuf or xstrfmt), or with xsnprintf if you know the length is bounded.

Is there a 'catch' with FastFormat?

I just read about the FastFormat C++ i/o formatting library, and it seems too good to be true: Faster even than printf, typesafe, and with what I consider a pleasing interface:
// prints: "This formats the remaining arguments based on their order - in this case we put 1 before zero, followed by 1 again"
fastformat::fmt(std::cout, "This formats the remaining arguments based on their order - in this case we put {1} before {0}, followed by {1} again", "zero", 1);
// prints: "This writes each argument in the order, so first zero followed by 1"
fastformat::write(std::cout, "This writes each argument in the order, so first ", "zero", " followed by ", 1);
This looks almost too good to be true. Is there a catch? Have you had good, bad or indifferent experiences with it?
Is there a 'catch' with FastFormat?
Last time I checked, there was one annoying catch:
You can only use either the narrow string version or the wide string version of this library. (The functions for wchar_t and char are the same -- which type is used is a compile time switch.)
With iostreams, stdio or Boost.Format you can use both.
Found one "catch", though for most people it will never manifest. From the project page:
Atomic operation. It doesn't write out statement elements one at a time, like the IOStreams, so has no atomicity issues
The only way I can see this happening is if it buffers the whole write() call's output itself, then writes it out to the ostream in one step. This means it needs to allocate memory, and if an object passed into the write() call produces a lot of output (several megabytes or more), it can consume up to twice that much memory in internal buffers (assuming it uses the grow-a-buffer-by-doubling-its-size-each-time trick).
If you're just using it for logging, and not, say, dumping huge amounts of XML, you'll never see this problem.
The only other "catch" I'm seeing is:
Highly portable. It will work with all good modern C++ compilers; it even works with Visual C++ 6!
So it won't work with an old C++ compiler, like cfront, whereas iostreams is backward compatible to the late 80's. Again, I'd be surprised if anyone ever had a problem with this.
Although FastFormat is a good library there are a number of issues with it:
Limited formatting support, in particular the following features are not supported:
Leading zeros (or any other non-space padding)
Octal/hexadecimal encoding
Runtime width/alignment specification
The library is quite big for a relatively small task of formatting and has even bigger dependency (STLSoft).
It looks pretty interesting indeed! Good tip regardless, and +1 for that!
I've been playing with it for a bit. The main drawback I see is that FastFormat supports less formatting options for the output. This is I think a direct consequence of the way the higher typesafety is achieved, and a good tradeoff depending on your circumstances.
If you look in detail at his performance benchmark page, you'll notice that good old C printf-family functions are still winning on Linux. In fact, the only test case where they perform poorly is the test case that should be static string concatenations, where I would expect printf to be wasteful. Moreover, GCC provides static type-checking on printf-style function calls, so the benefit of type-safety is reduced. So: if you are running on Linux and if you need the absolute best performance, FastFormat is probably not the optimal solution.
The library depends on a couple of environment variables, as mentioned in the docs.
That might be no biggie to some people, but I'd prefer my code to be as self-contained as possible. If I check it out from source control, it should work and compile. It won't, if it requires you to set environment variables.