I'm attempting to write a simple program to extract some data from a bunch of AVRO files. The schema for each file may be different so I would like to read the files generically (i.e. without having to pregenerate and then compile in the schema for each) using the C++ interface.
I have been attempting to follow the generic.cc example but it assumes a separate schema where I would like to read the schema from each AVRO file.
Here is my code:
#include <fstream>
#include <iostream>
#include "Compiler.hh"
#include "DataFile.hh"
#include "Decoder.hh"
#include "Generic.hh"
#include "Stream.hh"
const std::string BOLD("\033[1m");
const std::string ENDC("\033[0m");
const std::string RED("\033[31m");
const std::string YELLOW("\033[33m");
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReaderBase dataFile(argv[1]);
auto dataSchema = dataFile.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::DecoderPtr decoder = avro::binaryDecoder();
auto inStream = avro::fileInputStream(argv[1]);
decoder->init(*inStream);
avro::GenericDatum datum(dataSchema);
avro::decode(*decoder, datum);
std::cout << "Type: " << datum.type() << std::endl;
return 0;
}
Everytime I run the code, no matter what file I use, I get this:
$ ./avrotest twitter.avro
AVRO Test
terminate called after throwing an instance of 'avro::Exception'
what(): Cannot have negative length: -40 Aborted
In addition to my own data files, I have tried using the data files located here: https://github.com/miguno/avro-cli-examples, with the same result.
I tried using the avrocat utility on all of the same files and it works fine. What am I doing wrong?
(NOTE: outputting the data schema for each file in JSON works correctly as expected)
After a bunch more fooling around, I figured it out. You're supposed to use DataFileReader templated with GenericDatum. With the end result being something like this:
#include <fstream>
#include <iostream>
#include "Compiler.hh"
#include "DataFile.hh"
#include "Decoder.hh"
#include "Generic.hh"
#include "Stream.hh"
const std::string BOLD("\033[1m");
const std::string ENDC("\033[0m");
const std::string RED("\033[31m");
const std::string YELLOW("\033[33m");
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReader<avro::GenericDatum> reader(argv[1]);
auto dataSchema = reader.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::GenericDatum datum(dataSchema);
while (reader.read(datum))
{
std::cout << "Type: " << datum.type() << std::endl;
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord& r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
// TODO: pull out each field
}
}
return 0;
}
Perhaps an example like this should be included with libavro...
Related
This question is in reference to: How to read data from AVRO file using C++ interface?
int main(int argc, char**argv)
{
std::cout << "AVRO Test\n" << std::endl;
if (argc < 2)
{
std::cerr << BOLD << RED << "ERROR: " << ENDC << "please provide an "
<< "input file\n" << std::endl;
return -1;
}
avro::DataFileReader<avro::GenericDatum> reader(argv[1]);
auto dataSchema = reader.dataSchema();
// Write out data schema in JSON for grins
std::ofstream output("data_schema.json");
dataSchema.toJson(output);
output.close();
avro::GenericDatum datum(dataSchema);
while (reader.read(datum))
{
std::cout << "Type: " << datum.type() << std::endl;
if (datum.type() == avro::AVRO_RECORD)
{
const avro::GenericRecord& r = datum.value<avro::GenericRecord>();
std::cout << "Field-count: " << r.fieldCount() << std::endl;
// TODO: pull out each field
}
}
return 0;
}
I used this code, but keep getting a seg fault at the while loop. I have a very large schema and a large amount of data. Decoding the data piece by piece as the Avro examples gives in its "cpx" example is not practical, I need a generic way of reading. I get the seg fault the 3rd time through (consistently) with no error code returned from the read(). Open to any and all suggestions and ideas about reading large schemas in Avro.
As it turns out there is an open ticket/issue on the Avro page for this exact issue. https://issues.apache.org/jira/browse/AVRO-3194
I am using the boost gzip_decompressor() from the following link:
How can I read line-by-line using Boost IOStreams' interface for Gzip files?
Reading the gzip file works fine, but how do I read the gzip_params? I want to know the original file name that's stored in the gzip_params.file_name.
Excellent question.
The solution is to use component<N, T> to get a pointer to the actual decompressor instance:
Live On Coliru
#include <iostream>
#include <fstream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
int main()
{
std::ifstream file("file.gz", std::ios_base::in | std::ios_base::binary);
try {
boost::iostreams::filtering_istream in;
using gz_t = boost::iostreams::gzip_decompressor;
in.push(gz_t());
in.push(file);
for(std::string str; std::getline(in, str); )
{
std::cout << "Processed line " << str << '\n';
}
if (gz_t* gz = in.component<0, gz_t>()) {
std::cout << "Original filename: " << gz->file_name() << "\n";
std::cout << "Original mtime: " << gz->mtime() << "\n";
std::cout << "Zip comment: " << gz->comment() << "\n";
}
}
catch(const boost::iostreams::gzip_error& e) {
std::cout << e.what() << '\n';
}
}
Preparing a sample file using
gzip testj.txt
mv testj.txt.gz file.gz
Prints
Processed line Hello world
Original filename: testj.txt
Original mtime: 1518987084
Zip comment:
In my c++ code I need to write a lot of data into a file and I would like to use the boost mapped file instead of using normal file. Only when I finish writing all the data in memory I would like to dump the mapped file to the disk on one shot.
I use Visual Studio 2010 on Windows Server 2008 R2 and boost 1.58.
I've never used mapped file so I tried to compile the example on the boost documentation
#include <iostream>
#include <fstream>
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
int main(int argc, char** argv)
{
using namespace boost::interprocess;
const char* fileName = "C:\\logAcq\\test.bin";
const std::size_t fileSize = 10000;
std::cout << "create file" << std::endl;
try
{
file_mapping::remove(fileName);
std::filebuf fbuf;
fbuf.open(fileName, std::ios_base::in | std::ios_base::out | std::ios_base::trunc | std::ios_base::binary);
std::cout << "set size" << std::endl;
fbuf.pubseekoff(fileSize-1, std::ios_base::beg);
fbuf.sputc(0);
std::cout << "remove on exit" << std::endl;
struct file_remove
{
file_remove(const char* fileName)
:fileName_(fileName) {}
~file_remove(){ file_mapping::remove(fileName_); }
const char *fileName_;
}remover(fileName);
std::cout << "create file mapping" << std::endl;
file_mapping m_file(fileName, read_write);
std::cout << "map the whole file" << std::endl;
mapped_region region(m_file, read_write);
std::cout << "get the address" << std::endl;
void* addr = region.get_address();
std::size_t size = region.get_size();
std::cout << "write all memory to 1" << std::endl;
memset(addr, 1, size);
}
catch (interprocess_exception &ex)
{
fprintf(stderr, "Exception %s\n", ex.what());
fflush(stderr);
system("PAUSE");
return 0;
}
system("PAUSE");
return 0;
}
but I get the exception
Exception The volume for a file has been externally altered so that the opened file is no longer valid.
when I create the region
"mapped_region region(m_file, read_write)"
Any help is appreciate.
Thanks
Exception The volume for a file has been externally altered so that the opened file is no longer valid.
Strongly suggests that the file is changed by another program, while it was mapped. And the error message indicates the change happened to affect the size in such a way that is not allowed.
Avoid other programs writing to the file, or have proper synchronization and sharing precautions (like, don't change the size, or only grow etc.)
UPDATE
Your added SSCCE confirms that you held the file open while mapping:
You need to close the fbuf before mapping the file. Also, you need to remove the mapping before allowing it to be removed.
Working sample:
Live On Coliru
#include <iostream>
#include <fstream>
#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
int main() {
using namespace boost::interprocess;
const char *fileName = "test.bin";
const std::size_t fileSize = 10000;
std::cout << "create file " << fileName << std::endl;
try {
file_mapping::remove(fileName);
{
std::filebuf fbuf;
fbuf.open(fileName, std::ios_base::in | std::ios_base::out | std::ios_base::trunc | std::ios_base::binary);
std::cout << "set size" << std::endl;
fbuf.pubseekoff(fileSize - 1, std::ios_base::beg);
fbuf.sputc(0);
}
std::cout << "remove on exit" << std::endl;
struct file_remove {
file_remove(const char *fileName) : fileName_(fileName) {}
~file_remove() { file_mapping::remove(fileName_); }
const char *fileName_;
} remover(fileName);
{
std::cout << "create file mapping" << std::endl;
file_mapping m_file(fileName, read_write);
std::cout << "map the whole file" << std::endl;
mapped_region region(m_file, read_write);
std::cout << "get the address" << std::endl;
void *addr = region.get_address();
std::size_t size = region.get_size();
std::cout << "write all memory to 1" << std::endl;
memset(addr, 1, size);
}
} catch (interprocess_exception &ex) {
fprintf(stderr, "Exception %s\n", ex.what());
fflush(stderr);
}
}
Description:
I am trying to move all the files in a directory to a certain(user choosen directory)based on their extension to a certain directory via boost file system.
Problem:
When the rename/copy_file method of boost filesystem gets hit,I am receiving the R6010-Abort method called error.
Example:
SourceDirectory:C:\Source\a.txt
DestinationDirectory:C:\Destination
After execution:
DestinationDirectory:C:\Destination\a.txt
Code:
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include "boost/filesystem/operations.hpp"
#include "boost/filesystem/path.hpp"
#include "boost/progress.hpp"
#include "boost/algorithm/string/regex.hpp"
#include "boost/regex.hpp"
namespace fs = boost::filesystem;
using namespace std;
void categorizeFolder()
{
//Source Folder
std::string folderToCategorize;
cout<<"Choose the folder you want to categorize:";
cin>>folderToCategorize;
cout << "The directory you have choosen is: " << folderToCategorize << endl;
//Destination folder
std::string newfolder;
cout<<"Choose the folder you want to store your files:";
cin>>newfolder;
cout << "The directory you have choosen is: " << newfolder << endl;
std::vector< std::string > all_matching_files;
boost::filesystem::directory_iterator end_itr;
for( boost::filesystem::directory_iterator i( folderToCategorize ); i != end_itr; ++i )
{
if( !boost::filesystem::is_regular_file( i->status() ) ) continue;
if( i->path().extension() == ".txt" )
{
cout<<i->path().extension();//Printing File extension
cout<<i->path();//Printing file path
cout<<i->path().filename()<<endl; //Printing filename
fs::rename(i->path(), newfolder);//This would move the file//Even tried fs::copy_file(i->path(), newfolder)
}
}
}
Kindly let me know if i am missing something in the above code.Thanks in advance.
Regards,
Ravi
The linux error looks like this:
terminate called after throwing an instance of 'boost::filesystem::filesystem_error'
what(): boost::filesystem::rename: Is a directory: "/tmp/first/test.txt", "/tmp/second"
The fact that the API call is named rename and not, e.g. moveToFolder, should have given you an idea that you need to supply a full pathname in the target.
fs::rename(
it->path(),
fs::path(newfolder) / it->path().filename());
to fix it.
Here's a version with some better organization and error handling. It will even create the target directory if it doesn't already exist!
#include <boost/filesystem.hpp>
#include <iostream>
#include <string>
namespace fs = boost::filesystem;
using namespace std;
void categorizeFolder(fs::path folderToCategorize, fs::path newfolder)
{
if (!fs::exists(newfolder))
fs::create_directories(newfolder);
if (!fs::is_directory(newfolder))
{
std::cerr << "Destination folder does not exist and could not be created: " << fs::absolute(newfolder) << "\n";
return;
}
for(fs::directory_iterator it(folderToCategorize), end_itr; it != end_itr; ++it)
{
if(!fs::is_regular_file(it->status()))
continue;
if(it->path().extension() == ".txt")
{
// std::cout << it->path().extension() << "\n";
// std::cout << it->path() << "\n";
// std::cout << it->path().filename() << "\n";
fs::rename(it->path(), fs::path(newfolder) / it->path().filename()); // move the file
}
}
}
int main(int argc, const char *argv[])
{
if (argc<3)
{
std::cout << "Usage: " << argv[0] << " folderToCategorize newfolder\n";
return 255;
}
std::string const folderToCategorize = argv[1];
std::string const newfolder = argv[2];
std::cout << "The directory you have choosen is: " << folderToCategorize << endl;
std::cout << "The directory you have choosen is: " << newfolder << endl;
categorizeFolder(folderToCategorize, newfolder);
}
I am trying a reasonably simple program to test binary input/output. I am basically writing a file with a header (string) and some data (doubles). The code is as follows:
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
int main() {
typedef std::ostream_iterator<double> oi_t;
typedef std::istream_iterator<double> ii_t;
std::ofstream ofs("data.bin", std::ios::in);
//-If file doesn't exist, create a new one now
if(!ofs) {
ofs.open("data.bin", std::ios::out|std::ios::binary|std::ios::app);
}
else {
ofs.close();
ofs.open("data.bin", std::ios::out|std::ios::binary|std::ios::app);
}
//-Write a header consisting of length of grid subdomain and its name
///*
const std::string grid = "Header";
unsigned int olen = grid.size();
ofs.write(reinterpret_cast<const char*>(&olen), sizeof(olen));
ofs.write(grid.c_str(), olen);
//*/
//-Now write the data
///*
std::vector<double> data_out;
//std::vector<std::pair<int, int> > cell_ids;
for(int i=0; i<100; ++i) {
data_out.push_back(5.0*double(i) + 100.0);
}
ofs << std::setprecision(4);
std::copy(data_out.begin(), data_out.end(), oi_t(ofs, " "));
//*/
ofs.close();
//-Now read the binary file; first header then data
std::ifstream ifs("data.bin", std::ios::binary);
///*
unsigned int ilen;
ifs.read(reinterpret_cast<char*>(&ilen), sizeof(ilen));
std::string header;
if(ilen > 0) {
char* buf = new char[ilen];
ifs.read(buf,ilen);
header.append(buf,ilen);
delete[] buf;
}
std::cout << "Read header: " << header << "\n";
//*/
///*
std::vector<double> data_in;
ii_t ii(ifs);
std::copy(ii, ii_t(), std::back_inserter(data_in));
std::cout << "Read data size: " << data_in.size() << "\n";
//*/
ifs.close();
//-Check the result
///*
for(int i=0; i < data_out.size(); ++i) {
std::cout << "Testing input/output element #" << i << " : "
<< data_out[i] << " " << data_in[i] << "\n";
}
std::cout << "Element sizes: " << data_out.size() << " " << data_in.size() <<
"\n";
//*/
return 0;
}
The problem is that when I try to write and read (and then print) both the header and the data it fails (I confirmed that it doesn't read the data then, but displays the header correctly). But when I comment out one of the write sections (header and/or data), it displays that part correctly indicating the read worked. I am sure I am not doing the read properly. Perhaps I am missing the usage of seekg somewhere.
The code runs fine for me. However you never check if the file is successfully opened for writing, so it could be silently failing on your system. After you open ofs you should add
if (!ofs) {
std::cout << "Could not open file for writing" << std::endl;
return 1;
}
And the same thing after you open ifs
if (!ifs) {
std::cout << "Could not open file for reading" << std::endl;
return 1;
}
Or something along those lines. Also I do not understand why you check if the file exists first since you do the same whether it exists or not.
This should work
#include <iostream>
using std::cout;
using std::cerr;
using std::cin;
using std::endl;
#include <fstream>
using std::ifstream;
#include <cstdint>
int main() {
ifstream fin;
fin.open("input.dat", std::ios::binary | std::ios::in);
if (!fin) {
cerr << "Cannot open file " << "input.dat" << endl;
exit(1);
}
uint8_t input_byte;
while (fin >> input_byte) {
cout << "got byte " << input_byte << endl;
}
return 0;
}