Reading large schemas with GenericData from Avro in C++ - c++

This question is in reference to: How to read data from AVRO file using C++ interface?
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReader<avro::GenericDatum> reader(argv[1]);
auto dataSchema = reader.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::GenericDatum datum(dataSchema);
while (reader.read(datum))
{
std::cout << "Type: " << datum.type() << std::endl;
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord& r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
// TODO: pull out each field
}
}
return 0;
}
I used this code, but keep getting a seg fault at the while loop. I have a very large schema and a large amount of data. Decoding the data piece by piece as the Avro examples gives in its "cpx" example is not practical, I need a generic way of reading. I get the seg fault the 3rd time through (consistently) with no error code returned from the read(). Open to any and all suggestions and ideas about reading large schemas in Avro.

As it turns out there is an open ticket/issue on the Avro page for this exact issue. https://issues.apache.org/jira/browse/AVRO-3194

Related

Having trouble opening a json file in C++

I am trying to open a json file that I will be working with in C++. Code that I have used successfully before fails to open the file. I am using Visual Studio 2017 on Windows 10 Pro with JSON for Modern C++ version 3.5.0.
I have a very simple function, which is supposed to open a file as input to a json object. It appears to open the file, but aborts when writing it to the json object. Originally the file to be opened was in another directory, but I moved it into the same directory as the executable while testing...but it didn't help.
Here is the very short function that fails:
json baselineOpenAndRead(string fileName) //passed string used for filename
{
json baseJObject;
cout << "we have a baseJObject" << endl;
//ifstream inFileJSON("test_file.json"); // Making this explicit made no difference
ifstream inFileJSON;
inFileJSON.open("test_file.json", ifstream::in);
cout << "we have opened json inFileJSON" << endl; // get here
inFileJSON >> baseJObject;
cout << " Can direct inFileJSON into baseJObject" << endl; //never get here; the app aborts.
inFileJSON.close();
return baseJObject;
}
This seems basically identical to the example on the nlohmann site:
// read a JSON file
std::ifstream i("file.json");
json j;
i >> j;
I just expected this to open the json file, load it into the object, and return the object. Instead, it just quits.
Thanks for any thoughts...i.e., what am I doing wrong? (I'm going to ignore that it worked before...maybe I missed something).
--Al
As requested, here is a minimal reproducible example, but it will require nlohmann's json.hpp in order to compile:
#include <iostream>
#include <fstream>
#include "json.hpp"
using json = nlohmann::json;
using namespace std;
string fileName;
json baselineOpenAndRead(string);
int main(int argC, char *argV[])
{
json baseJObject;
if (argC != 2) // check to make sure proper number of arguments are given.
{
cout << "\n\nFilename needed...";
exit(1); // number of arguments is wrong - exit program
}
else
{
fileName = argV[1];
baseJObject = baselineOpenAndRead(fileName); // opens and reads the Base Line JSON file
cout << "baseJObject returned" << endl;
}
return 0;
}
json baselineOpenAndRead(string fileName) //
{
cout << "File name: " << fileName << endl;
json baseJObject;
cout << "we have a baseJObject" << endl;
ifstream inFileJSON(fileName);
if (inFileJSON.is_open())
{
cout << "file open..." << endl;
if (nlohmann::json::accept(inFileJSON))
{
cout << "valid json" << endl;
try { inFileJSON >> baseJObject; }
catch (const std::exception &e) { std::cout << e.what() << '\n'; throw; }
}
else
{
cout << "not valid json" << endl;
}
}
else
{
cout << "file not really open" << endl;
}
inFileJSON >> baseJObject;
cout << " We can echo inFileJSON into baseJObject" << endl;
inFileJSON.close();
return baseJObject;
}
I tested it with this json file:
{
"people": [{
"name": "Scott",
"website": "stackabuse.com",
"from": "Nebraska"
},
{
"name": "Larry",
"website": "google.com",
"from": "Michigan"
},
{
"name": "Tim",
"website": "apple.com",
"from": "Alabama"
}
]
}
When I run this passing it the json above as data.json, I get the following output and then it quits:
./Test_json data.json
File name: data.json
we have a baseJObject
file open...
valid json
[json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Without the try, it just quits. It never gets past inFileJSON >> baseJObject;
Another try that seems to work, but why?
OK. I tried this with the same main (the only changes are in the function):
json baselineOpenAndRead(string fileName) //
{
json baseJObject;
string filePath = "../baselines/" + fileName;
cout << "filePath: " << filePath << endl;
ifstream inFileJSON(fileName);
//baseJObject = json::parse(inFileJSON);
inFileJSON >> baseJObject;
cout << baseJObject << std::endl;
return baseJObject;
}
This looks basically the same to me. I tried making it ifstream inFileJSON(fileName.c_str()) on both the original and in this one. The original continued to fail, this one continued to work. Sorry this is getting so long, but I can't get decent formatting out of comments... Should I just try answering my own question instead?
I think I've got this. I believe my initial problem was caused by an errant ',' in one of my json test files. Subsequently, the if (inFileJSON.is_open) worked, but the if (nlohmann::json::accept(inFileJSON) was failing and causing the same (or perhaps a similar) error. I thought that I needed the c_str() for file paths outside of the executable's directory, but it doesn't seem to make a difference one way or the other. I took out the accept(), and this code seems to work consistently:
json baselineOpenAndRead(string fileName) //
{
json baseJObject;
cout << "we have a baseJObject" << endl;
string filePath = "../baselines/" + fileName;
cout << "filePath: " << filePath << endl;
//ifstream inFileJSON(filePath.c_str());
ifstream inFileJSON(filePath);
if (inFileJSON.is_open())
{
cout << "File is open." << endl;
inFileJSON >> baseJObject;
cout << baseJObject << std::endl;
inFileJSON.close();
return baseJObject;
}
else
{
cout << "File not open." << endl;
exit(1);
}
}
Thanks to everyone for your help. I appreciate it.
--Al

How to read data from AVRO file using C++ interface?

I'm attempting to write a simple program to extract some data from a bunch of AVRO files. The schema for each file may be different so I would like to read the files generically (i.e. without having to pregenerate and then compile in the schema for each) using the C++ interface.
I have been attempting to follow the generic.cc example but it assumes a separate schema where I would like to read the schema from each AVRO file.
Here is my code:
#include <fstream>
#include <iostream>
#include "Compiler.hh"
#include "DataFile.hh"
#include "Decoder.hh"
#include "Generic.hh"
#include "Stream.hh"
const std::string BOLD("\033[1m");
const std::string ENDC("\033[0m");
const std::string RED("\033[31m");
const std::string YELLOW("\033[33m");
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReaderBase dataFile(argv[1]);
auto dataSchema = dataFile.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::DecoderPtr decoder = avro::binaryDecoder();
auto inStream = avro::fileInputStream(argv[1]);
decoder->init(*inStream);
avro::GenericDatum datum(dataSchema);
avro::decode(*decoder, datum);
std::cout << "Type: " << datum.type() << std::endl;
return 0;
}
Everytime I run the code, no matter what file I use, I get this:
$ ./avrotest twitter.avro
AVRO Test
terminate called after throwing an instance of 'avro::Exception'
what(): Cannot have negative length: -40 Aborted
In addition to my own data files, I have tried using the data files located here: https://github.com/miguno/avro-cli-examples, with the same result.
I tried using the avrocat utility on all of the same files and it works fine. What am I doing wrong?
(NOTE: outputting the data schema for each file in JSON works correctly as expected)
After a bunch more fooling around, I figured it out. You're supposed to use DataFileReader templated with GenericDatum. With the end result being something like this:
#include <fstream>
#include <iostream>
#include "Compiler.hh"
#include "DataFile.hh"
#include "Decoder.hh"
#include "Generic.hh"
#include "Stream.hh"
const std::string BOLD("\033[1m");
const std::string ENDC("\033[0m");
const std::string RED("\033[31m");
const std::string YELLOW("\033[33m");
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReader<avro::GenericDatum> reader(argv[1]);
auto dataSchema = reader.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::GenericDatum datum(dataSchema);
while (reader.read(datum))
{
std::cout << "Type: " << datum.type() << std::endl;
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord& r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
// TODO: pull out each field
}
}
return 0;
}
Perhaps an example like this should be included with libavro...

libarchive returns error on some entries while 7z can extract normally

I'm having trouble with libarchive version 3.3.2. I wrote a program to read selected entries in 7z archives, that look like:
file.7z
|__ file.xml
|__ file.fog
|__ file_1.fog
However, the program failed to read file_1.fog for most of my archives, and failed to read file.fog for some. I tried to use archive_error_string() to see what happens, and the errors were either corrupted archive or truncated RAR archive or Decompressing internal error.
Here's the trouble code:
void list_archive(string name) {
struct archive *a;
struct archive_entry *entry;
// create new archive struct for the file
a = archive_read_new();
archive_read_support_filter_all(a);
archive_read_support_format_all(a);
// open 7z file
int r = archive_read_open_filename(a, name.c_str(), 1024);
if (r != ARCHIVE_OK) {
cout << "cannot read file: " << name << endl;
cout << "read error: " << archive_error_string(a) << endl;
}
// looping through entries
for (;;) {
int status = archive_read_next_header(a, &entry);
// if there's no more header
if (status != ARCHIVE_OK) break;
// print some status messages to stdout
string pathname(archive_entry_pathname(entry));
cout << "working on: " << pathname << endl;
size_t entry_size = archive_entry_size(entry);
// load the entry's content
char * content;
content = (char*)malloc(entry_size);
r = archive_read_data(a, content, entry_size);
// check if archive_read_data was successful
if (r > 0) {
cout << "read " << r << " of " << entry_size << " bytes successfully\n";
// we are interested in .fog file only
if (pathname.back() == 'g') {
// do something with the .fog file
}
}
else // usually the error happens here
if (archive_errno(a) != ARCHIVE_OK) cout << "read error: " << archive_error_string(a) << endl;
// free the content and clear the entry
archive_read_data_skip(a);
free(content);
archive_entry_clear(entry);
cout << "-----" << endl;
}
// we are done with the current archive, free it
r = archive_read_free(a);
if (r != ARCHIVE_OK) {
cout << "Failed to free archive object. Error: " << archive_error_string(a) << endl;
exit(1);
}
}
I found the troublemaker and answer here if future users have the same problem.
int r = archive_read_open_filename(a, name.c_str(), 1024);
Apparently 1024 is too small for a buffer size. I increased it to 102400 and was able to read/extract all my archives.
Be aware, technically buffer size should not break functionality, it's OK to reduce speed but it's not acceptable to break the operation, therefore I think the way it's processing archives is not that reliable.

How to parse json data from websocket_client using cpprestsdk

I'm connecting to a WebSocket whom always replies in JSON. I see there is an extract_string method for websocket_incoming_message however after trying numerous things with json:value it seems as though you can only construct JSON arrays on-the-fly by inserting key-value pairs one-by-one. Am I missing something here or is there a way to take the output from websocket_incoming_message and directly convert it into a json:value array?
websocket_client client;
//start socket connection to server
try {
std::cout << "s
----------
client.connect(U("wss://XZXXXZZy.com/ws?account_id=4de3f308f2f8d3247As70228f94e0d2aAea&ws_key=reception")).wait();
}
catch (const std::exception&e)
{
std::cout << e.what() << std::endl;
}
//send messages to the server
//websocket_outgoing_message msg;
//msg.set_pong_message();
//std::cout << "\n...........2nd.........;";
//std::string data = "hii";
//client.send(msg).then([]() {
//
//
//
//
// /* Successfully sent the message. */ });
//std::cout << " Successfully sent the message.";
//std::cout << "\n...........3rd.........;";
//receive messages from the server
client.receive().then([](websocket_incoming_message msg) {
std::cout << "receiving data from socket";
return msg.extract_string();
}).then([](std::string body) {
//FETCHING THE DATA FROM BODY. "TEXT/JSON"
std::cout << "displaying the data";
std::cout << body << std::endl;
const json::value& v1 = body.substr;
utility::string_t jsonval = v1.serialize();
auto array = v1.at(U("rows")).as_array();
for (int i = 0; i<array.size(); ++i)
{
auto id = array[i].at(U("id")).as_string();
std::wcout << "\n" << id;
auto key = array[i].at(U("key")).as_string();
std::wcout << "\n" << key;
auto array2 = array[i].at(U("value")).as_array();
std::wcout << array2[0];
std::wcout << array2[1];
}
}
);
//close the connection
client.close().then([]() {
std::cout << "successfully close socket connction";
/* Successfully closed the connection. */
});
I have json response in my string body.but i dont know how to parse json data from websocket responses event. i want to display contacts from api responses.please help me..
MY JSON RESPONSES
--------------------------------------
.{"action":"refresh_dashboard","data":{"users_list":[{"user_id":"901e6076ff351cfc2195fb86f8438a26","extensions":["1002"],"name":"Karthik M"},{"user_id":"cc3f94ecc14ee9c55670dcde9adc1887","extensions":["1006"],"name":"Rounak S Kiran"},{"user_id":"6c29ebdb34e1761fdf9423c573087979","extensions":["1003"],"name":"Amar Nath"},{"user_id":"74d5b5a9aca1faa4c2f217ce87b621d8","extensions":["1008"],"name":"Robin Raju"},{"user_id":"a7ad7e73bf93ea83c8efdc1723cba198","extensions":["1007"],"name":"Arshad Arif"},{"user_id":"b55146df593ec8d09e5fe12a8a4c1108","extensions":["1001"],"name":"Rahib Rasheed"},{"user_id":"3258f7ae4ae1db60435cbcf583f64a89","extensions":["1009"],"name":"Test User"},{"user_id":"90bc84e5e8a3427fe35e99bd4386de95","extensions":["1010"],"name":"Prince T"},{"user_id":"b501ef5b270a196afc0eed557ca74237","extensions":["1005","+17325951060"],"name":"Jineed AJ"},{"user_id":"1422af351e06adeab2de92f5a633a444","extensions":["1004"],"name":"Ashok PA"}],"busy_users":[],"reg_users":[{"user_id":"cc3f94ecc14ee9c55670dcde9adc1887","status":"registered"},{"user_id":"901e6076ff351cfc2195fb86f8438a26","status":"registered"},{"user_id":"1422af351e06adeab2de92f5a633a444","status":"registered"},{"user_id":"3258f7ae4ae1db60435cbcf583f64a89","status":"registered"},{"user_id":"b55146df593ec8d09e5fe12a8a4c1108","status":"registered"},{"user_id":"6c29ebdb34e1761fdf9423c573087979","status":"registered"}],"contacts":[{"owner_id":"cc3f94ecc14ee9c55670dcde9adc1887","status":"ready"},{"owner_id":"901e6076ff351cfc2195fb86f8438a26","status":"ready"},{"owner_id":"1422af351e06adeab2de92f5a633a444","status":"ready"},{"owner_id":"3258f7ae4ae1db60435cbcf583f64a89","status":"ready"},{"owner_id":"b55146df593ec8d09e5fe12a8a4c1108","status":"ready"},{"owner_id":"6c29ebdb34e1761fdf9423c573087979","status":"ready"}]}}
I got the complete solution .please try to use boost pacakges from nuget. The documentation will help you to parse the json data from string. I think jsoncpp is not an updated packages available in the nuget.so please try boost packages available in the nuget.
MYJSON STRING
{"action":"refresh_dashboard","data":{"users_list":[{"user_id":"901e6076ff351cfc2195fb86f8438a26","extensions":["1002"],"name":"Karthik M"},{"user_id":"7d617ef5b2390d081d901b0d5cd108eb","extensions":["1015"],"name":"Synway User2"},{"user_id":"c8f667f7d663e81f6e7fa34b9296f067","extensions":["1012"],"name":"Rahib Video"},{"user_id":"cc3f94ecc14ee9c55670dcde9adc1887","extensions":["1006"],"name":"Rounak S Kiran"},{"user_id":"6c29ebdb34e1761fdf9423c573087979","extensions":["1003"],"name":"Amar Nath"},{"user_id":"8e15c2d95d4325cb07f0750846966be8","extensions":["1011"],"name":"TLS User"},{"user_id":"2fc4142bdacf83c1957bda0ad9d50e3d","extensions":["1014"],"name":"Synway User1"},{"user_id":"74d5b5a9aca1faa4c2f217ce87b621d8","extensions":["1008"],"name":"Robin Raju"},{"user_id":"a7ad7e73bf93ea83c8efdc1723cba198","extensions":["1007"],"name":"Arshad Arif"},{"user_id":"b55146df593ec8d09e5fe12a8a4c1108","extensions":["1001"],"name":"Rahib Rasheed"},{"user_id":"391391de005a8f5403c7b5591f462ea1","extensions":["1013"],"name":"Sangeeth J"},{"user_id":"3258f7ae4ae1db60435cbcf583f64a89","extensions":["1009"],"name":"Aby TL"},{"user_id":"90bc84e5e8a3427fe35e99bd4386de95","extensions":["1010"],"name":"Prince T"},{"user_id":"b501ef5b270a196afc0eed557ca74237","extensions":["1005"],"name":"Jineed AJ"},{"user_id":"1422af351e06adeab2de92f5a633a444","extensions":["1004"],"name":"Ashok PA"}],"busy_users":[],"reg_users":[{"user_id":"901e6076ff351cfc2195fb86f8438a26","status":"registered"},{"user_id":"6c29ebdb34e1761fdf9423c573087979","status":"registered"}],"contacts":[{"owner_id":"901e6076ff351cfc2195fb86f8438a26","status":"ready"},{"owner_id":"6c29ebdb34e1761fdf9423c573087979","status":"ready"}]}}
CODES
client.receive().then([](websocket_incoming_message msg) {
std::cout << "receiving data from socket";
// msg.message_type();
return msg.extract_string();
//1..i have one string
//cout<<"\n///////////test"<< msg.extract_string().get().c_str();
// // 2.convert to json array
//json::value::parse( ::to_string_t(msg.extract_string().get()))
//
}).then([](std::string body) {
//std::cout << "displaying the data";
std::cout << body << std::endl;
std::string ss = body;
ptree pt;
std::istringstream is(ss);
read_json(is, pt);
std::cout <<"\n 1st"<< "action: " << pt.get<std::string>("action") << "\n";
std::cout <<"\n 2nd"<< "data: " << pt.get<std::string>("data") << "\n";
std::cout << "--------------------------------------------------------------";
for (auto& e : pt.get_child("data.users_list")) {
std::cout << "\n" << "users id " << e.second.get<std::string>("user_id") << "\n";
}
});
useful resources
Parse JSON array as std::string with Boost ptree
C++ boost parse dynamically generated json string (not a file)

Writing to and reading from a file at the same time

I have two processes. One writes to a file, one has to read from it (At the same time..). So there's two fstreams open at a given time for the file (Although they may be in different processes).
I wrote a simple test function to crudely implement the sort of functionality I need:
void test_file_access()
{
try {
std::string file_name = "/Users/xxxx/temp_test_folder/test_file.dat";
std::ofstream out(file_name,
std::ios_base::out | std::ios_base::app | std::ios_base::binary);
out.write("Hello\n", 7);
std::this_thread::sleep_for(std::chrono::seconds(1));
std::array<char, 4096> read_buf;
std::ifstream in(file_name,
std::ios_base::in | std::ios_base::binary);
if (in.fail()) {
std::cout << "Error reading file" << std::endl;
return;
}
in.exceptions(std::ifstream::failbit | std::ifstream::badbit);
//Exception at the below line.
in.read(read_buf.data(), read_buf.size());
auto last_read_size = in.gcount();
auto offset = in.tellg();
std::cout << "Read [" << read_buf.data() << "] from file. read_size = " << last_read_size
<< ", offset = " << offset << std::endl;
out.write("World\n", 7);
std::this_thread::sleep_for(std::chrono::seconds(1));
//Do this so I can continue from the position I was before?
//in.clear();
in.read(read_buf.data(), read_buf.size());
last_read_size = in.gcount();
offset = in.tellg();
std::cout << "Read [" << read_buf.data() << "] from file. read_size = " << last_read_size
<< ", offset = " << offset << std::endl;
//Remove if you don't have boost.
boost::filesystem::remove(file_name);
}
catch(std::ios_base::failure const & ex)
{
std::cout << "Error : " << ex.what() << std::endl;
std::cout << "System error : " << strerror(errno) << std::endl;
}
}
int main()
{
test_file_access();
}
Run, and the output is like this:
Error : ios_base::clear: unspecified iostream_category error
System error : Operation timed out
So two questions,
What is going wrong here? Why do I get an Operation timed out error?
Is this an incorrect attempt to do what I need to get done? If so, what are the problems here?
You write into this file 7 bytes, but then try to read 4096 bytes. So in stream will read only 7 bytes and throw an exception as requested. Note that if you catch this exception the rest of the code will be executed correctly, e.g. last_read_size will be 7 and you can access those 7 bytes in buffer.