BigQuery Stroage Read API with C++ Deserializing Data

BigQuery Stroage Read API with C++ Deserializing Data - c++

I'm trying to implement method Download table data in the Avro data format from this example,
but I don't know how to implement.
namespace {
void ProcessRowsInAvroFormat(
::google::cloud::bigquery::storage::v1::AvroSchema const&,
::google::cloud::bigquery::storage::v1::AvroRows const&) {
// Code to deserialize avro rows should be added here.
}
} // namespace
I installed Apache AVRO C++ library and write the codes like:
bool bq::ReadSessionFromSchema(std::string project_id, std::string dataset_id, std::string table_id)
try
{
auto table_name = "projects/" + project_id + "/datasets/" + dataset_id + "/tables/" + table_id;
int max_stream_count = 1;
google::cloud::bigquery::storage::v1::ReadSession read_session;
read_session.set_table(table_name);
read_session.set_data_format(google::cloud::bigquery::storage::v1::DataFormat::AVRO);
read_session.mutable_read_options()->set_row_restriction(R"(state_name = "Kentucky")");
auto session = read_client->CreateReadSession("projects/" + project_id, read_session, max_stream_count);
if (!session)
{
std::cerr << session.status() << "\n";
return false;
}
std::cout << "ReadSession successfully created: " << session->name()
<< ".\n";
constexpr int kRowOffset = 0;
auto read_rows = read_client->ReadRows(session->streams(0).name(), kRowOffset);
std::int64_t num_rows = 0;
for (auto const &row : read_rows)
{
if (row.ok())
{
num_rows += row->row_count();
std::cout << row->row_count() << std::endl;
[](::google::cloud::bigquery::storage::v1::AvroSchema const &schema,
::google::cloud::bigquery::storage::v1::AvroRows const &rows)
{
auto vs = avro::compileJsonSchemaFromString(schema.schema());
std::unique_ptr<avro::InputStream> in = avro::memoryInputStream((uint8_t *)(rows.serialized_binary_rows().data()), rows.serialized_binary_rows().size());
avro::DecoderPtr d = avro::validatingDecoder(vs, avro::binaryDecoder());
avro::GenericDatum datum(vs);
d->init(*in);
avro::decode(*d, datum);
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord &r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
for (auto i = 0; i < r.fieldCount(); i++)
{
const avro::GenericDatum &f0 = r.fieldAt(i);
if (f0.type() == avro::AVRO_STRING)
{
std::cout << "string: " << f0.value<std::string>() << std::endl;
}
else if (f0.type() == avro::AVRO_INT)
{
std::cout << "int: " << f0.value<int>() << std::endl;
}
else if (f0.type() == avro::AVRO_LONG)
{
std::cout << "long: " << f0.value<long>() << std::endl;
}
else
{
std::cout << f0.type() << std::endl;
}
}
}
}(session->avro_schema(), row->avro_rows());
}
}
std::cout << num_rows << " rows read from table: " << table_name << "\n";
return true;
}
catch (google::cloud::Status const &status)
{
std::cerr << "google::cloud::Status thrown: " << status << "\n";
return false;
}
BigQuery session gives me 3 chunks and each chunk has 41 rows, 25 rows and 10 rows in.
But with this code, I can only print first row in the chunks.
I want to print all rows what I received from session.
Original Data is here and I copied this table to my own project.
Expect Result.(78 rows)
project id: MY_PROJECT_ID
ReadSession successfully created: projects/MY_PROJECT_ID/locations/asia-northeast3/sessions/<MY_SESSION_ID>.
row count: 41
Field-count: 25
string: 21083
string: Graves County
string: Kentucky
long: 32460
long: 227
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
string: 21221
string: Trigg County
string: Kentucky
long: 17300
long: 19
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
........<More rows>
........<More rows>
row count: 27
Field-count: 25
string: 21013
string: Bell County
string: Kentucky
long: 33180
long: 223
long: 0
long: 0
long: 0
long: 3
long: 30
long: 26
long: 0
long: 4
long: 0
long: 0
long: 10
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 1
........<More rows>
........<More rows>
row count: 10
Field-count: 25
string: 21015
string: Boone County
string: Kentucky
long: 17140
long: 187
long: 0
long: 0
long: 0
long: 0
long: 51
long: 0
long: 0
long: 18
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 62
long: 0
long: 0
long: 0
long: 16
long: 12
........<More rows>
........<More rows>
78 rows read from table: projects/MY_PROJECT_ID/datasets/MY_DATASET_ID/tables/covid_19
Actual Result.(Only 3 rows)
project id: MY_PROJECT_ID
ReadSession successfully created: projects/MY_PROJECT_ID/locations/asia-northeast3/sessions/<MY_SESSION_ID>.
row count: 41
Field-count: 25
string: 21083
string: Graves County
string: Kentucky
long: 32460
long: 227
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
null
row count: 27
Field-count: 25
string: 21013
string: Bell County
string: Kentucky
long: 33180
long: 223
long: 0
long: 0
long: 0
long: 3
long: 30
long: 26
long: 0
long: 4
long: 0
long: 0
long: 10
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 1
row count: 10
Field-count: 25
string: 21015
string: Boone County
string: Kentucky
long: 17140
long: 187
long: 0
long: 0
long: 0
long: 0
long: 51
long: 0
long: 0
long: 18
long: 0
long: 0
long: 0
long: 0
long: 0
long: 0
long: 62
long: 0
long: 0
long: 0
long: 16
long: 12
78 rows read from table: projects/MY_PROJECT_ID/datasets/MY_DATASET_ID/tables/covid_19

After a lot spending times for toying codes, I found a working code below.
Use avro::GenericReader.read() to load data from avro::InputStream sequentially.
After parsing row, use avro::GenericReader.drain() to remove currunt row and read next row from avro::InputStreamPtr.
bool bq::ReadSessionFromSchema(std::string project_id, std::string dataset_id, std::string table_id)
try
{
auto table_name = "projects/" + project_id + "/datasets/" + dataset_id + "/tables/" + table_id;
int max_stream_count = 1;
google::cloud::bigquery::storage::v1::ReadSession read_session;
read_session.set_table(table_name);
read_session.set_data_format(google::cloud::bigquery::storage::v1::DataFormat::AVRO);
read_session.mutable_read_options()->set_row_restriction(R"(state_name = "Kentucky")");
auto session = read_client->CreateReadSession("projects/" + project_id, read_session, max_stream_count);
if (!session)
{
std::cerr << session.status() << "\n";
return false;
}
std::cout << "ReadSession successfully created: " << session->name()
<< ".\n";
constexpr int kRowOffset = 0;
auto read_rows = read_client->ReadRows(session->streams(0).name(), kRowOffset);
std::int64_t num_rows = 0;
for (auto const &row : read_rows)
{
if (row.ok())
{
num_rows += row->row_count();
std::cout << "row count: " << row->row_count() << std::endl;
[](::google::cloud::bigquery::storage::v1::AvroSchema const &schema,
::google::cloud::bigquery::storage::v1::AvroRows const &rows,
int64_t count)
{
const avro::ValidSchema vs = avro::compileJsonSchemaFromString(schema.schema());
std::istringstream iss(rows.serialized_binary_rows(), std::ios::binary);
std::unique_ptr<avro::InputStream> in = avro::istreamInputStream(iss);
avro::DecoderPtr d = avro::validatingDecoder(vs, avro::binaryDecoder());
avro::GenericReader gr(vs, d);
d->init(*in);
avro::GenericDatum datum(vs);
for (auto i = 0; i < count; i++)
{
gr.read(*d, datum, vs);
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord &r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
for (auto i = 0; i < r.fieldCount(); i++)
{
const avro::GenericDatum &f0 = r.fieldAt(i);
if (f0.type() == avro::AVRO_STRING)
{
std::cout << "string: " << f0.value<std::string>() << std::endl;
}
else if (f0.type() == avro::AVRO_INT)
{
std::cout << "int: " << f0.value<int>() << std::endl;
}
else if (f0.type() == avro::AVRO_LONG)
{
std::cout << "long: " << f0.value<long>() << std::endl;
}
else
{
std::cout << f0.type() << std::endl;
}
}
}
gr.drain();
}
}(session->avro_schema(), row->avro_rows(), row->row_count());
}
}
std::cout << num_rows << " rows read from table: " << table_name << "\n";
return true;
}
catch (google::cloud::Status const &status)
{
std::cerr << "google::cloud::Status thrown: " << status << "\n";
return false;
}

Related

Why is my vector of type <bool> only storing 1s?

I'm trying to put translate a string into a huffman encoding using a huffman map that is stored in an unordered_map<char, string>. As you can see, it prints correctly where I am putting it in, but after I push_back() the encoding to result and then print it out, it shows up as just 1s instead of 1s and 0s. Any ideas?
I have tried printing it out and decoding.
vector<bool> result{};
for (auto i : huffmanMap)
{
cout << i.first << ": " << i.second << "*" << endl;
}
for (auto word: text)
{
cout << word << endl;
for (auto ch : word)
{
cout << "char: " << ch << " encoding: "<< huffmanMap[ch] <<
endl;
for (auto binarych: huffmanMap[ch])
{
cout << "binarych " << binarych << endl;
result.push_back(binarych);
}
}
}
for (auto next: result)
{
cout << next;
}
cout << endl;
return result;
a: 0*
c: 10*
b: 11*
aaabbc
char: a encoding: 0
binarych 0
char: a encoding: 0
binarych 0
char: a encoding: 0
binarych 0
char: b
encoding: 11
binarych 1
binarych 1
char: b
encoding: 11
binarych 1
binarych 1
char: c encoding: 10
binarych 1
binarych 0
111111111

You're confusing digits and numbers.
'0' and '1' are not 0 and 1, and both '0' and '1' convert to true.
You can convert a digit to the number it represents by subtracting '0' from it:
result.push_back(binarych - '0');
The digits are required by C++ to be encoded in increasing order without "gaps".

copy member variable into byte vector

I want to copy a 64-bit member variable into a vector byte by byte.
Please avoid telling me to use bit operation to extract each byte and then copy them into vector.
I want to do this by one line.
I use memcpy and copy methods, but both of them failed.
Here is the sample code:
#include <iostream>
#include <vector>
#include <cstdint>
#include <cstring>
using namespace std;
class A {
public:
A()
: eight_bytes_data(0x1234567812345678) {
}
void Test() {
vector<uint8_t> data_out;
data_out.reserve(8);
memcpy(data_out.data(),
&eight_bytes_data,
8);
cerr << "[Test]" << data_out.size() << endl;
}
void Test2() {
vector<uint8_t> data_out;
data_out.reserve(8);
copy(&eight_bytes_data,
(&eight_bytes_data) + 8,
back_inserter(data_out));
cerr << "[Test2]" << data_out.size() << endl;
for (auto value : data_out) {
cerr << hex << value << endl;
}
}
private:
uint64_t eight_bytes_data;
};
int main() {
A a;
a.Test();
a.Test2();
return 0;
}

As the others already showed where you were getting wrong, there is a one line solution that is dangeurous.
First you need to make sure that you vector has enough size to receive 8 bytes. Something like this:
data_out.resize(8);
The you can do a reinterpret_cast to force your compiler to interpret those 8 bytes from the vector to be seen as an unique type of 8 bytes, and do the copy
*(reinterpret_cast<uint64_t*>(data_out.data())) = eight_bytes_data;
I can't figure out all the possibilities of something going wrong. So use at your own risk.

If you want to work with the bytes of another type structure, you could use a char* to manipulate each byte:
void Test3()
{
vector<uint8_t> data_out;
char* pbyte = (char*)&eight_bytes_data;
for(int i = 0; i < sizeof(eight_bytes_data); ++i)
{
data_out.push_back(pbyte[i]);
}
cerr << "[Test]" << data_out.size() << endl;
}
Unfortunately, you requested a one-line-solution, which I don't think is viable.

If you are interested in more generic version:
namespace detail
{
template<typename Byte, typename T>
struct Impl
{
static std::vector<Byte> impl(const T& data)
{
std::vector<Byte> bytes;
bytes.resize(sizeof(T)/sizeof(Byte));
*(T*)bytes.data() = data;
return bytes;
}
};
template<typename T>
struct Impl<bool, T>
{
static std::vector<bool> impl(const T& data)
{
std::bitset<sizeof(T)*8> bits(data);
std::string string = bits.to_string();
std::vector<bool> vector;
for(const auto& x : string)
vector.push_back(x - '0');
return vector;
}
};
}
template<typename Byte = uint8_t,
typename T>
std::vector<Byte> data_to_bytes(const T& data)
{
return detail::Impl<Byte,T>::impl(data);
}
int main()
{
uint64_t test = 0x1111222233334444ull;
for(auto x : data_to_bytes<bool>(test))
std::cout << std::hex << uintmax_t(x) << " ";
std::cout << std::endl << std::endl;
for(auto x : data_to_bytes(test))
std::cout << std::hex << uintmax_t(x) << " ";
std::cout << std::endl << std::endl;
for(auto x : data_to_bytes<uint16_t>(test))
std::cout << std::hex << uintmax_t(x) << " ";
std::cout << std::endl << std::endl;
for(auto x : data_to_bytes<uint32_t>(test))
std::cout << std::hex << uintmax_t(x) << " ";
std::cout << std::endl << std::endl;
for(auto x : data_to_bytes<uint64_t>(test))
std::cout << std::hex << uintmax_t(x) << " ";
std::cout << std::endl << std::endl;
}
Output:
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1
0 0 0 1 0 0 0 1 0 0 0 1 0 0
44 44 33 33 22 22 11 11
4444 3333 2222 1111
33334444 11112222
1111222233334444

WideCharToMultiByte doesn't work in Wine

I'm trying to use WideCharToMultiByte in order to convert std::wstring to utf8 std::string. Here is my code:
const std::wstring & utf16("lorem ipsum"); // input
if (utf16.empty()) {
return "";
}
cout << "wstring -> string, input: , size: " << utf16.size() << endl;
for (size_t i = 0; i < utf16.size(); ++i) {
cout << i << ": " << static_cast<int>(utf16[i]) << endl;
}
for (size_t i = 0; i < utf16.size(); ++i) {
wcout << static_cast<wchar_t>(utf16[i]);
}
cout << endl;
std::string res;
int required_size = 0;
if ((required_size = WideCharToMultiByte(
CP_UTF8,
0,
utf16.c_str(),
utf16.size(),
nullptr,
0,
nullptr,
nullptr
)) == 0) {
throw std::invalid_argument("Cannot convert.");
}
cout << "required size: " << required_size << endl;
res.resize(required_size);
if (WideCharToMultiByte(
CP_UTF8,
0,
utf16.c_str(),
utf16.size(),
&res[0],
res.size(),
nullptr,
nullptr
) == 0) {
throw std::invalid_argument("Cannot convert.");
}
cout << "Result: " << res << ", size: " << res.size() << endl;
for (size_t i = 0; i < res.size(); ++i) {
cout << i << ": " << (int)static_cast<uint8_t>(res[i]) << endl;
}
exit(1);
return res;
It runs OK, no exceptions, no error. Only the result is wrong. Here is output from running the code:
wstring -> string, input: , size: 11
0: 108
1: 111
2: 114
3: 101
4: 109
5: 32
6: 105
7: 112
8: 115
9: 117
10: 109
lorem ipsum
required size: 11
Result: lorem , size: 11
0: 108
1: 0
2: 111
3: 0
4: 114
5: 0
6: 101
7: 0
8: 109
9: 0
10: 32
I don't understand why are there the null bytes. What am I doing wrong?

Summarizing from comments:
Your code is correct as far as the WideCharToMultiByte logic and arguments go; the only actual problem is the initialization of utf16, which needs to be initialized with a wide literal. The code gives the expected results with both VC++ 2015 RTM and Update 1, so this is a bug in the WideCharToMultiByte emulation layer you're using.
That said, for C++11 onwards, there is a portable solution you should prefer when possible: std::wstring_convert in conjunction with std::codecvt_utf8_utf16
#include <cstddef>
#include <string>
#include <locale>
#include <codecvt>
#include <iostream>
std::string test(std::wstring const& utf16)
{
std::wcout << L"wstring -> string, input: " << utf16 << L", size: " << utf16.size() << L'\n';
for (std::size_t i{}; i != utf16.size(); ++i)
std::wcout << i << L": " << static_cast<int>(utf16[i]) << L'\n';
for (std::size_t i{}; i != utf16.size(); ++i)
std::wcout << utf16[i];
std::wcout << L'\n';
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> cvt;
std::string res = cvt.to_bytes(utf16);
std::wcout << L"Result: " << res.c_str() << L", size: " << res.size() << L'\n';
for (std::size_t i{}; i != res.size(); ++i)
std::wcout << i << L": " << static_cast<int>(res[i]) << L'\n';
return res;
}
int main()
{
test(L"lorem ipsum");
}
Online Demo

Unexpected big endian conversion output

I am using libflac and I need to convert my data from little endian to big endian. However in one of my test code i am not getting what I expect. I am using g++
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
int main() {
unsigned char transform[4];
unsigned char output[4];
unsigned char temp;
int normal = 24000;
memcpy(output, &normal, 4);
std::cout << (int)output[0] << " " << (int)output[1] << " " << (int)output[2] << " " << (int)output[3] << "\n";
//FLAC__int32 big_endian;
int big_endian;
short allo = 24000;
memcpy(transform, &allo, 2); // transform[0], transform[1]
std::cout << (int)transform[0] << " " << (int)transform[1] << "\n";
//big_endian = (FLAC__int32)(((FLAC__int16)(FLAC__int8)transform[1] << 8) | (FLAC__int16)transform[0]); // whaaat, doesn't work...
big_endian = transform[1] << 8 | transform[0]; // this also give 192 93 0 0 uh?
memcpy(output, &big_endian, 4);
std::cout << (int)output[0] << " " << (int)output[1] << " " << (int)output[2] << " " << (int)output[3] << "\n";
// 192 93 0 0 uh?
// this one works
transform[3] = transform[0];
transform[2] = transform[1];
transform[0] = 0;
transform[1] = 0;
memcpy(&big_endian, transform, 4);
memcpy(output, &big_endian, 4);
std::cout << (int)output[0] << " " << (int)output[1] << " " << (int)output[2] << " " << (int)output[3] << "\n";
// 0 0 93 192 (binary)93 << 8 | (binary)192 = 24000
return 0;
}
output:
192 93 0 0
192 93
192 93 0 0
0 0 93 192
When I do
big_endian = transform[1] << 8 | transform[0];
I'd expect to see 93 192 0 0 or 0 0 93 192, what's going on?

The problem is in this line
big_endian = transform[1] << 8 | transform[0];
transform[0] is keeping the LSB in little endian. When you do transform[1] << 8 | transform[0] you store it in the LSB position, therefore it doesn't move anywhere and is still the lowest byte. The same to transform[1] which is the second byte and it's still the second byte after shifting.
Use this
big_endian = transform[0] << 8 | transform[1];
or
big_endian = transform[0] << 24 | transform[1] << 16 | transform[2] << 8 | transform[3];
But why don't just write a function for endian conversion?
unsigned int convert_endian(unsigned int n)
{
return (n << 24) | ((n & 0xFF00) << 8) | ((n & 0xFF0000) >> 8) | (n >> 24);
}
or use the ntohl/ntohs function that is already available on every operating systems

Why is this double value printed as "-0"?

double a = 0;
double b = -42;
double result = a * b;
cout << result;
The result of a * b is -0, but I expected 0. Where did I go wrong?

The bit representation of -0.0 and 0.0 are different, but they are same value, so -0.0==0.0 would return true. In your case, result is -0.0, because one of the operand is negative.
See this demo:
#include <iostream>
#include <iomanip>
void print_bytes(char const *name, double d)
{
unsigned char *pd = reinterpret_cast<unsigned char*>(&d);
std::cout << name << " = " << std::setw(2) << d << " => ";
for(int i = 0 ; i < sizeof(d) ; ++i)
std::cout << std::setw(-3) << (unsigned)pd[i] << " ";
std::cout << std::endl;
}
#define print_bytes_of(a) print_bytes(#a, a)
int main()
{
double a = 0.0;
double b = -0.0;
std::cout << "Value comparison" << std::endl;
std::cout << "(a==b) => " << (a==b) <<std::endl;
std::cout << "(a!=b) => " << (a!=b) <<std::endl;
std::cout << "\nValue representation" << std::endl;
print_bytes_of(a);
print_bytes_of(b);
}
Output (demo#ideone):
Value comparison
(a==b) => 1
(a!=b) => 0
Value representation
a = 0 => 0 0 0 0 0 0 0 0
b = -0 => 0 0 0 0 0 0 0 128
As you can see yourself, the last byte of -0.0 is different from the last byte of 0.0.
Hope that helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BigQuery Stroage Read API with C++ Deserializing Data - c++

Related

Why is my vector of type <bool> only storing 1s?

copy member variable into byte vector

WideCharToMultiByte doesn't work in Wine

Unexpected big endian conversion output

Why is this double value printed as "-0"?

Categories

Resources