Usage of BYTES in BigQuery? - google-cloud-platform

I was wondering what might be a use case where someone would be using the BYTES type in BigQuery? In digging through the public datasets that are provided, the only place I could find the usage of the BYTES data type is in the bitcoin_blockchain dataset, and in that case it looks like the data could be base64-encoded as a string (briefly glancing at the preview it seems this may already be like this)?
So basically my question is what are some use-cases for using the Bytes datatype, where it couldn't just as easily be done using the String type. (Does anyone store multimedia data in BQ or a data warehouse?). Could BQ do everything it currently does without the BYTES type or is that an essential (and used) type?

BASE64 string inflate the data by 33% on the size. Considering BigQuery charge you by the size of the data stored, and the size of data that is scanned, if there is lots of binary data to the degree that cost is a concern, BYTES gives you lower cost.

Related

AWS Redshift Data type space allocation

I am planning to create id CHARACTER VARYING(100) ENCODE ZSTD,
id2 CHARACTER VARYING(5000) ENCODE ZSTD.
Now my id and id1 are about size 20 characters only.
In AWS Redshift does the space allocation happen based on actual data size which is 20 or does it allocate first based on defined size which is 100 and 5000 respectively. If so how is the performance effected on these scenarios.
Thanks
TOM
Two things here.
Storage: With varchars, the amount of space consumed is based on the actual amount of space required, not the length (in bytes) declared.
Query Performance: Redshift does not know in advance how many bytes will be required to hold the varchar. It allocates the number of bytes based on the length declared for the varchar. It will cause queries to consume more memory which can in certain cases cause queries to spill to disk. This can have particularly negative impact on vacuum performance.
Summary: Declare varchars to be as short as possible. So, in your case if it's 20 or so, maybe 25-30 would be a good length to go with.
Amazon Redshift stores data using a Compression Encoding, so it is not vital to allocate the minimum space.
It is often best to allow Redshift to choose the compression type when data is loaded via the COPY command, rather than specifying it yourself. This will result in the most efficient method being chosen, based on the first 100,000 rows loaded.

Reading data from a Binary/Random Access File

I have a file in binary format having a large amount of data.
If I have knowledge of the file structure, how do I read information from the binary file, and populate a record of these structures?
The data is complex.
I'd like to do it with Qt, but I would do it in C++ as well, if required.
Thanks for your help..
If the binary file is really large then it’s better to load it as (char*) array if enough RAM is available via low level read function http://crasseux.com/books/ctutorial/Reading-files-at-a-low-level.html
and then you can parse it.
But this will only help you to load large files, not to parse complex structures.
Not sure, but you could also take a look at yacc.
This doesn't sound like yacc will be a solution, he isn't trying to parse the file, he wants to read binary formatted data to a data structure.
You can read the data in and then map it to a struct that matches the format. If the data is complex you may need to lay structs over it in a variety of ways depending on how the data layout works. So basically read the file into a char* or and then select the element where your struct starts, cast that element to a pointer to your stuct and then access the element. Without more detail it's impossible to be more specific than this.
http://courses.cs.vt.edu/~cs2604/fall00/binio.html would be help for you. I've learned from there. (hint always cast your data as(char*) ).

Reading a 200 mb json file takes 1.5 gb memory

I'm using the json_spirit library in C++ to parse a 200 mb json file. What surprises me is that when read into memory in my program, 1.5 gb of my RAM gets used. Is this something that is expected when deserializing json?
Here is how I'm loading in the json file:
std::ifstream istream(path.c_str());
json_spirit::mValue val;
json_spirit::read(istream, val);
You may try rapidjson.
It is optimized for both memory usage and performance.
By using insitu-parsing option (i.e. it changes the source string of parsing), it only incurs 16 bytes per JSON value to store the DOM in 32-bit architecture. The string values will use pointers pointed to the modified source string.
I expect the memory usage will be much less.
On the other hand, rapidjson also support SAX-style parsing. If the application just need to traverse the JSON file from begin to end (e.g. doing some statistics), then SAX-style API will be even faster and very few memory consumption (program stack + maximum length of string value).
I think, this is not JSON dependent. It's more a question of data structure overhead. If you have many small objects the administrative part becomes more and more relevant.
Although, more than seven times overhead, seems really excessive.

Big Endian bytes vs. Strings as keys in string - string databases

I haven't seen the common sense notion of converting an integer to network order and to write the resulting bytes into an indexable entity in a string - string database vs. writing the string representation of the number anywhere in the documentation of such databases.
Surely the size overhead of writing a 64-bit int as a string into a database must outweigh the trivial complexity of having to do a ntohl call before writing the bytes back into an integer type.
I am therefore missing something here, what are the downsides to using big-endian bytes vs. strings as indexable entities in string-string databases ?
(C++/C tags as I am talking about writing bytes into the memory location of a programatic type, BDB as that is the database I am using, could be kyotodb as well).
The advantage of big-endian in this case is that the strings would sort correctly in ascending order.
If the database architecture cannot natively store 64-bit integers, but you need to store them anyway, stringifying them this way is a way to do it.
Of course if you later upgrade the database to one that can store 64-bit integers natively, you will either be "stuck" with the implementation or have to go through a migration process.
If the database validates that the string data you send is valid in the expected encoding then you can't just give it any data you want. You'll only be able to send such integers as happen to look like a valid encoding. I don't know if BDB or kyotodb do such validation.
Also it seems to me like a hack to try to trick one data type to hold something else, and then rely on clients to all know the trick. Of course that applies whether you're using the string to hold a ascii-decimal representation of the integer or if you're using the string as a raw memory buffer to hold the integer. It seems to me that it'd be better to use a database that actually holds the types you want to hold, instead of just strings.

limit on string size in c++?

I have like a million records each of about 30 characters coming in over a socket. Can I read all of it into a single string? Is there a limit on the string size I can allocate?
If so, is there someway I can send data over the socket records by record and receive it record by record. I dont know the size of each record until runtime.
To answer your first question: The maximum size of a C++ string is given by string::max_size
std::string::max_size() will tell you the theoretical limit imposed by the architecture your program is running under. Other than that, as long as you have sufficient RAM and/or disk swap space, you can have std::strings of huge size.
The answer to your second question is yes, you can send record by record, moreover you might not be able to send big chunks of data over a socket at once - there are limits on the size of a single send operation. That the size of a single string is not known until runtime is not a problem, it doesn't need to be known at compile time for sending them over a socket. How to actually send those strings record by record depends on what socket/networking library you are using; consult the relevant documentation.
There is no official limit on the size of a string. The software will ask your system for memory and, as long as it gets it, it will be able to add characters to your string.
The rest of your question is not clear.
The only practical limit on string size in c++ is your available memory. That being said, it will be expensive to reallocate your string to the right size as you keep receiving data (assuming you do not know its total size in advance). Normally you would read chunks of the data into a fixed-size buffer and decode it into its naturally shape (your records) as you get it.
The size of a string is only limited by the amount of memory available to the program, it is more of a operating system limitation than a C++ limitation. C++/C strings are null terminated so the string routines will happily process extremely long strings until they find a null.
On Win32 the maximum amount of memory available for data is normally around 2 Gigs.
You can read arbitrarily large amounts of data from a socket, but you must have some way of delimiting the data that you're reading. There must be an end of record marker or length associated with the records that you are reading so use that to parse the records. Do you really want read the data into a string? What happens if your don't have enough free memory to hold the data in RAM? I suspect there is a more efficient way to handle this data, but I don't know enough about the problem.
In theory, no. But don't go allocating 100GB of memory, because the user will probably not have that much RAM. If you are using std::strings then the max size is std::string::npos.
If we are talking about char* You are limited with smth about 2^32 on 32-bit systems and with 2^64 on (surprise) 64-bit ones
Update: This is wrong. See comments
How about send them with different format?
in your server:
send(strlen(szRecordServer));
send(szRecordServer);
in you client:
recv(cbRecord);
alloc(szRecordClient);
recv(szRecordClient);
and repeat this million times.