How exactly do the hashlib hashers treat input? - python-2.7

The Python 2.7 documentation has this to say about the hashlib hashers:
hash.update(arg)
Update the hash object with the string arg. [...]
But I have seen people feed it objects that are not strings, e.g. buffers, numpy ndarrays.
Given Python's duck typing, I'm not surprised that it is possible to specify non-string arguments.
The question is: how do I know the hasher is doing the right thing with the argument?
I can't imagine the hasher naïvely doing a shallow iteration on the argument because that would probably fail miserably with ndarrays with more than one dimension - if you do a shallow iteration, you get an ndarray with n-1 dimensions.

update unpacks its arguments using the s# format spec. This means that it can be either a string, Unicode or a buffer interface.
You can't define a buffer interface in pure Python, but C libraries like numpy can and do - which allows them to be passed into hash.update.
Things like multiple dimension arrays work fine - on the C level they're stored as a contiguous series of bytes.

Related

Generate all permutations of a given String in D

I'm trying to write a program in D that generates all permutations for a given string. I've been trying to use the function nextPermutation, but it's only compatible with ints. I can't get it to work with a char array. I was wondering if someone could help point me in the right direction? This is what I have so far:
import std.stdio;
import std.algorithm.sorting: nextPermutation;
void main()
{
char array[] = {'a','b','c'};
do
{
writeln(array);
} while (nextPermutation(array));
}
So it isn't only compatible with ints, it is anything that Phobos considers "bidirectional" and "swappable" - an array it can reverse easily and swap individual elements, and it considers plain string to be non-swappable due to UTF-8 encoding. Due to its variable length element encoding, swapping two chars may require reshuffling the entire array, which would be far more expensive than the function allows.
Thus, the easiest way to make this work is to use a type which Phobos considers to be swappable: a UTF-32 string, aka dchar[].
If you just change your char to dchar, it will work.
You might also want to change your array syntax from C style to D style:
dchar[] array = ['a','b','c'];
There you go.
So, I said "it considers" because this is kinda a controversial library decision. I'd argue UTF-32 isn't really swappable for a similar reason that UTF-8 isn't - there can be paired elements and changing their order can corrupt the data. But you don't need to worry about that for simple cases like you have.

What's the OCaml naming convention for "constructors"?

An OCaml module usually contains at least one abstract type whose idiomatic name is t. Also, there's usually a function that constructs a value of that type.
What is the usual / idiomatic name for this?
The StdLib is not consistent here. For example:
There's Array.make and a deprecated function Array.create. So that function should be named make?
On the other hand, there's Buffer.create but not Buffer.make. So that function should be named create?
Some people find this way of module design makes OCaml programming easier, but this is not a mandatory OCaml programming style, and I do not think there is no official name for it. I personally call it "1-data-type-per-1-module" style. (I wrote a blog post about this but it is in Japanese. I hope some autotranslator gives some useful information to you ...)
Defining a module dedicated to one data type and fix the name of the type t has some values:
Nice namespacing
Module names explain about what its type and values are, therefore you do not need to repeat type names inside: Buffer.add_string instead of add_string_to_buffer, and Buffer.create instead of create_buffer. You can also avoid typing the same module names with local module open:
let f () =
let open Buffer in
let b = create 10 in (* instead of Buffer.create *)
add_string b "hello"; (* instead of Buffer.add_string *)
contents b (* instead of Buffer.contents *)
Easy ML functor application
If an ML functor takes an argument module with a data type, we have a convention that the type should be called t. Modules with data type t are easily applied to these functors without renaming of the type.
For Array.create and Array.make, I think this is to follow the distinction of String.create and String.make.
String.create is to create a string with uninitialized contents. The created string contains random bytes.
String.make is to create a string filled with the given char.
We had Array.create for long, to create an array whose contents are filled with the given value. This behavior corresponds with String.make rather than String.create. That's why it is now renamed to Array.make, and Array.create is obsolete.
We cannot have Array.create in OCaml with the same behaviour of String.create. Unlike strings, arrays cannot be created without initialization, since random bytes may not represent a valid OCaml value for the content in general, which leads to a program crash.
Following this, personally I use X.create for a function to create an X.t which does not require an initial value to fill it. I use X.make if it needs something to fill.
I had the same question when I picked up the language a long time ago. I never use make and I think few people do.
Nowadays I use create for heavy, often imperative or stateful values, e.g. a Unicode text segmenter. And I use v for, functional, lighter values in DSL/combinator based settings, e.g. the various constructors in Gg, for example for 2D vectors, or colors.
As camlspotter mentions in his answer the standard library distinguishes make and create for values that need an initial value to fill in. I think it's better to be regular here and always use create regardless. If your values support an optional initial fill value, add an optional argument to create rather than multiply the API entry points.

SML Basis Library: what's the rationale for `ArraySlice.copyVec`?

Prior understanding
According to The ArraySlice structure as reported by standardml.org (I don't have an SML 97 manual to check, only an SML 90 PDF manual), ArraySlice.copyVec gets an Array.array parameter as the destination, and not an ArraySlice.slice as one (or at least me) would intuitively expects. Of course, one can use ArraySlice.base to get an array and an index for respectively the dst and di parameters to copyVec. Surprisingly, copyVec from ArraySlice, does not even have a single parameter of type ArraySlice.slice. Fortunately, its src parameter is of type VectorSlice.slice, as intuitively expected.
The question
What's the rationale for ArraySlice.copyVec? Why doesn't it get an ArraySlice.slice as dst?
Presumably, because that would require the additional constraint that both slices have the same length (or at least dst is larger than src). The current API gets by without such an extra side condition. Also, it would probably be somewhat more cumbersome to use.

Python array management C++ equivalent

I know SO is not rent-a-coder, but I have a really simple python example that I need help translating to C++
grey_image_as_array = numpy.asarray( cv.GetMat( grey_image ) )
non_black_coords_array = numpy.where( grey_image_as_array > 3 )
# Convert from numpy.where()'s two separate lists to one list of (x, y) tuples:
non_black_coords_array = zip( non_black_coords_array[1], non_black_coords_array[0] )
First one is rather simple I guess - a linear indexable array is created with what bytes are retruned from cv.GetMat, right?
What would be an equivalent of pyton's where and especially this zip functions?
I don't know about OpenCV, so I can't tell you what cv.GetMat() does. Apparently, it returns something that can be used as or converted to a two-dimensional array. The C or C++ interface to OpenCV that you are using will probably have a similarly names function.
The following lines create an array of index pairs of the entries in grey_image_as_array that are bigger than 3. Each entry in non_black_coords_array are zero based x-y-coordinates into grey_image_as_array. Given such a coordinates pair x, y, you can access the corresponsing entry in the two-dimensional C++ array grey_image_as_array with grey_image_as_array[y][x].
The Python code has to avoid explicit loops over the image to achieve good performance, so it needs to make to with the vectorised functions NumPy offers. The expression grey_image_as_array > 3 is a vectorised comparison and results in a Boolean array of the same shape as grey_image_as_array. Next, numpy.where() extracts the indices of the True entries in this Boolean array, but the result is not in the format described above, so we need zip() to restructure it.
In C++, there's no need to avoid explicit loops, and an equivalent of numpy.where() would be rather pointless -- you just write the loops and store the result in the format of your choice.

Named parameter string formatting in C++

I'm wondering if there is a library like Boost Format, but which supports named parameters rather than positional ones. This is a common idiom in e.g. Python, where you have a context to format strings with that may or may not use all available arguments, e.g.
mouse_state = {}
mouse_state['button'] = 0
mouse_state['x'] = 50
mouse_state['y'] = 30
#...
"You clicked %(button)s at %(x)d,%(y)d." % mouse_state
"Targeting %(x)d, %(y)d." % mouse_state
Are there any libraries that offer the functionality of those last two lines? I would expect it to offer a API something like:
PrintFMap(string format, map<string, string> args);
In Googling I have found many libraries offering variations of positional parameters, but none that support named ones. Ideally the library has few dependencies so I can drop it easily into my code. C++ won't be quite as idiomatic for collecting named arguments, but probably someone out there has thought more about it than me.
Performance is important, in particular I'd like to keep memory allocations down (always tricky in C++), since this may be run on devices without virtual memory. But having even a slow one to start from will probably be faster than writing it from scratch myself.
The fmt library supports named arguments:
print("You clicked {button} at {x},{y}.",
arg("button", "b1"), arg("x", 50), arg("y", 30));
And as a syntactic sugar you can even (ab)use user-defined literals to pass arguments:
print("You clicked {button} at {x},{y}.",
"button"_a="b1", "x"_a=50, "y"_a=30);
For brevity the namespace fmt is omitted in the above examples.
Disclaimer: I'm the author of this library.
I've always been critic with C++ I/O (especially formatting) because in my opinion is a step backward in respect to C. Formats needs to be dynamic, and makes perfect sense for example to load them from an external resource as a file or a parameter.
I've never tried before however to actually implement an alternative and your question made me making an attempt investing some weekend hours on this idea.
Sure the problem was more complex than I thought (for example just the integer formatting routine is 200+ lines), but I think that this approach (dynamic format strings) is more usable.
You can download my experiment from this link (it's just a .h file) and a test program from this link (test is probably not the correct term, I used it just to see if I was able to compile).
The following is an example
#include "format.h"
#include <iostream>
using format::FormatString;
using format::FormatDict;
int main()
{
std::cout << FormatString("The answer is %{x}") % FormatDict()("x", 42);
return 0;
}
It is different from boost.format approach because uses named parameters and because
the format string and format dictionary are meant to be built separately (and for
example passed around). Also I think that formatting options should be part of the
string (like printf) and not in the code.
FormatDict uses a trick for keeping the syntax reasonable:
FormatDict fd;
fd("x", 12)
("y", 3.141592654)
("z", "A string");
FormatString is instead just parsed from a const std::string& (I decided to preparse format strings but a slower but probably acceptable approach would be just passing the string and reparsing it each time).
The formatting can be extended for user defined types by specializing a conversion function template; for example
struct P2d
{
int x, y;
P2d(int x, int y)
: x(x), y(y)
{
}
};
namespace format {
template<>
std::string toString<P2d>(const P2d& p, const std::string& parms)
{
return FormatString("P2d(%{x}; %{y})") % FormatDict()
("x", p.x)
("y", p.y);
}
}
after that a P2d instance can be simply placed in a formatting dictionary.
Also it's possible to pass parameters to a formatting function by placing them between % and {.
For now I only implemented an integer formatting specialization that supports
Fixed size with left/right/center alignment
Custom filling char
Generic base (2-36), lower or uppercase
Digit separator (with both custom char and count)
Overflow char
Sign display
I've also added some shortcuts for common cases, for example
"%08x{hexdata}"
is an hex number with 8 digits padded with '0's.
"%026/2,8:{bindata}"
is a 24-bit binary number (as required by "/2") with digit separator ":" every 8 bits (as required by ",8:").
Note that the code is just an idea, and for example for now I just prevented copies when probably it's reasonable to allow storing both format strings and dictionaries (for dictionaries it's however important to give the ability to avoid copying an object just because it needs to be added to a FormatDict, and while IMO this is possible it's also something that raises non-trivial problems about lifetimes).
UPDATE
I've made a few changes to the initial approach:
Format strings can now be copied
Formatting for custom types is done using template classes instead of functions (this allows partial specialization)
I've added a formatter for sequences (two iterators). Syntax is still crude.
I've created a github project for it, with boost licensing.
The answer appears to be, no, there is not a C++ library that does this, and C++ programmers apparently do not even see the need for one, based on the comments I have received. I will have to write my own yet again.
Well I'll add my own answer as well, not that I know (or have coded) such a library, but to answer to the "keep the memory allocation down" bit.
As always I can envision some kind of speed / memory trade-off.
On the one hand, you can parse "Just In Time":
class Formater:
def __init__(self, format): self._string = format
def compute(self):
for k,v in context:
while self.__contains(k):
left, variable, right = self.__extract(k)
self._string = left + self.__replace(variable, v) + right
This way you don't keep a "parsed" structure at hand, and hopefully most of the time you'll just insert the new data in place (unlike Python, C++ strings are not immutable).
However it's far from being efficient...
On the other hand, you can build a fully constructed tree representing the parsed format. You will have several classes like: Constant, String, Integer, Real, etc... and probably some subclasses / decorators as well for the formatting itself.
I think however than the most efficient approach would be to have some kind of a mix of the two.
explode the format string into a list of Constant, Variable
index the variables in another structure (a hash table with open-addressing would do nicely, or something akin to Loki::AssocVector).
There you are: you're done with only 2 dynamically allocated arrays (basically). If you want to allow a same key to be repeated multiple times, simply use a std::vector<size_t> as a value of the index: good implementations should not allocate any memory dynamically for small sized vectors (VC++ 2010 doesn't for less than 16 bytes worth of data).
When evaluating the context itself, look up the instances. You then parse the formatter "just in time", check it agaisnt the current type of the value with which to replace it, and process the format.
Pros and cons:
- Just In Time: you scan the string again and again
- One Parse: requires a lot of dedicated classes, possibly many allocations, but the format is validated on input. Like Boost it may be reused.
- Mix: more efficient, especially if you don't replace some values (allow some kind of "null" value), but delaying the parsing of the format delays the reporting of errors.
Personally I would go for the One Parse scheme, trying to keep the allocations down using boost::variant and the Strategy Pattern as much I could.
Given that Python it's self is written in C and that formatting is such a commonly used feature, you might be able (ignoring copy write issues) to rip the relevant code from the python interpreter and port it to use STL maps rather than Pythons native dicts.
I've writen a library for this puporse, check it out on GitHub.
Contributions are wellcome.