Open CSV file with Apache Arrow in C++ - c++

I'm trying to read a csv file with Apache Arrow but I can't get my head around the InputStream...
It seems the example in their documentation is out of date.
I've tweaked a bit the example but I get "Access violation reading location" exception. Any idea what I'm doint wrong?
Thanks
My code:
arrow::MemoryPool* pool = arrow::default_memory_pool();
std::shared_ptr<arrow::io::ReadableFile> infile;
infile->Open("test.csv", pool);
auto read_options = arrow::csv::ReadOptions::Defaults();
auto parse_options = arrow::csv::ParseOptions::Defaults();
auto convert_options = arrow::csv::ConvertOptions::Defaults();
// Instantiate TableReader from input stream and options
std::shared_ptr<arrow::csv::StreamingReader> reader;
auto res1 = reader->Make(pool, infile, read_options, parse_options, convert_options);
if (!res1.ok()) {
// Handle TableReader instantiation error...
}
std::shared_ptr<arrow::Table> table;
// Read table from CSV file
auto res2 = reader->ReadAll(&table);
if (!res2.ok()) {
// Handle CSV read error
// (for example a CSV syntax error or failed type conversion)
}

Related

How to get a file by path in C++/WinRT + how to properly handle async calls without crashing?

I'm new to C++ and Windows and trying to write a WinRT/C++ app that can grab a file via path and run it through the built-in Media::Ocr model, but I can't see to even properly get the file into my application as a StorageFile. Here's how I've been trying to do it (to no avail):
runModel()
---------
IAsyncAction runModel() {
string _pth = "pic.png";
hstring pth = to_hstring(_pth);
StorageFile file = nullptr;
co_await LoadFileAsync(pth);
}
IAsyncAction LoadFileAsync (hstring pth) {
StorageFile file = co_await Windows::Storage::StorageFile::GetFileFromPathAsync(pth);
co_await LoadImageAsync(file);
}
IAsyncAction LoadImageAsync(StorageFile const& file) {
auto stream = co_await file.OpenAsync(Windows::Storage::FileAccessMode::Read);
auto decoder = co_await BitmapDecoder::CreateAsync(stream);
auto bitmap = co_await decoder.GetSoftwareBitmapAsync();
auto imgSource = WriteableBitmap(bitmap.PixelWidth(), bitmap.PixelHeight());
bitmap.CopyToBuffer(imgSource.PixelBuffer());
OcrEngine ocrEngine = nullptr;
ocrEngine = OcrEngine::TryCreateFromUserProfileLanguages();
auto ocrResult = co_await ocrEngine.RecognizeAsync(bitmap);
hstring hresult = ocrResult.Text();
std::cout << hresult.begin();
}
I'd appreciate any help of the following:
Loading the file in as a StorageFile
The right way to handle these async calls. I tried to follow https://learn.microsoft.com/en-us/windows/uwp/cpp-and-winrt-apis/concurrency but I couldn't seem to get it to work without this error: !isStaThread()
How to properly do the OCR
Thanks!

Read CSV from std::vector<unsigned char> using Apache Arrow

I am trying to read a csv input format using Apache arrow. The example here mentions that the input should be an InputStream, however in my case I just have an std::vector of unsigned chars. Is it possible to parse this using apache arrow? I have checked the I/O interface to see if there is an "in-memory" data structure with no luck.
I copy-paste the example code for convenience here as well as my input data:
#include "arrow/csv/api.h"
{
// ...
std::vector<unsigned char> data;
arrow::io::IOContext io_context = arrow::io::default_io_context();
// how can I fit the std::vector to the input stream?
std::shared_ptr<arrow::io::InputStream> input = ...;
auto read_options = arrow::csv::ReadOptions::Defaults();
auto parse_options = arrow::csv::ParseOptions::Defaults();
auto convert_options = arrow::csv::ConvertOptions::Defaults();
// Instantiate TableReader from input stream and options
auto maybe_reader =
arrow::csv::TableReader::Make(io_context,
input,
read_options,
parse_options,
convert_options);
if (!maybe_reader.ok()) {
// Handle TableReader instantiation error...
}
std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
// Read table from CSV file
auto maybe_table = reader->Read();
if (!maybe_table.ok()) {
// Handle CSV read error
// (for example a CSV syntax error or failed type conversion)
}
std::shared_ptr<arrow::Table> table = *maybe_table;
}
Any help would be appreciated!
The I/O interface docs list BufferReader which works as an in-memory input stream. While not listed in the docs, it can be constructed from a pointer and a size which should let you use your vector<char>.

Write to existing json file

I am using this code to add to my existing JSON file. However It completely overrides my JSON file and just puts one JSON object in it when I would just like to add another item to the list of items in my JSON file. How would I fix this?
Json::Value root;
root[h]["userM"] = m;
root[h]["userT"] = t;
root[h]["userF"] = f;
root[h]["userH"] = h;
root[h]["userD"] = d;
Json::StreamWriterBuilder builder;
std::unique_ptr<Json::StreamWriter> writer(builder.newStreamWriter());
std::ofstream outputFileStream("messages.json");
writer-> write(root, &outputFileStream);
My recommendation is
Load the file into a Json::Value
Add or change whatever fields you want
Overwrite the original file with the updated Json::Value
Doing this is going to be the least error-prone method, and it'll work quickly unless you have a very large Json file.
How to read in the entire file
This is pretty simple! We make the root, then just use the >> operator to read in the file.
Json::Value readFile(std::istream& file) {
Json::Value root;
Json::Reader reader;
bool parsingSuccessful = reader.parse( file, root );
if(not parsingSuccessful) {
// Handle error case
}
return root;
}
See this documentation here for more information

Read n number of lines from s3 object using AWS lambda

In my Lambda I am trying to parse the content of a document from s3 bucket. The document I am processing is a txt file with more than 100Mb. I need to parse only the first line of the file.
What is the best cost-effective way to read the file?
Currently, I am taking the content using getObjectContent() method and taking the 1st line from it like this.
private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build ();
GetObjectRequest getObjectRequest = new GetObjectRequest(bucket, key);
S3Object s3Object = s3.getObject(getObjectRequest);
BufferedReader reader = new BufferedReader(new InputStreamReader(s3Object.getObjectContent()));
String firstLine;
try {
while ((firstLine = reader.readLine()) != null) {
logger.log("META PROCESSOR | FIRST LINE OF FILE : " + firstLine);
break;
}
} catch (IOException e) {
logger.log("META PROCESSOR | FAILED TO LOAD FIRST LINE ");
return null;
}
Is it a good way to read the entire content just to read the first line? Is there any method available to read n number of lines from a file or n number of bytes from a file?

C++/CX - DataReader out of bounds exception

I have the following code which opens a file and it works most of the time for one time. After that I get exceptions thrown and I don't know where the problem is hiding. I tried to look for this for a couple of days already but no luck.
String^ xmlFile = "Assets\\TheXmlFile.xml";
xml = ref new XmlDocument();
StorageFolder^ InstallationFolder = Windows::ApplicationModel::Package::Current->InstalledLocation;
task<StorageFile^>(
InstallationFolder->GetFileAsync(xmlFile)).then([this](StorageFile^ file) {
if (nullptr != file) {
task<Streams::IRandomAccessStream^>(file->OpenAsync(FileAccessMode::Read)).then([this](Streams::IRandomAccessStream^ stream)
{
IInputStream^ deInputStream = stream->GetInputStreamAt(0);
DataReader^ reader = ref new DataReader(deInputStream);
reader->InputStreamOptions = InputStreamOptions::Partial;
reader->LoadAsync(stream->Size);
strXml = reader->ReadString(stream->Size);
MessageDialog^ dlg = ref new MessageDialog(strXml);
dlg->ShowAsync();
});
}
});
The error is triggered at this part of the code:
strXml = reader->ReadString(stream->Size);
I get the following error:
First-chance exception at 0x751F5B68 in XmlProject.exe: Microsoft C++ exception: Platform::OutOfBoundsException ^ at memory location 0x02FCD634. HRESULT:0x8000000B The operation attempted to access data outside the valid range
WinRT information: The operation attempted to access data outside the valid range
Just like I said, the first time it just works but after that I get the error. I tried detaching the stream and buffer of the datareader and tried to flush the stream but no results.
I've also asked this question on the Microsoft C++ forums and credits to the user "Viorel_" I managed to get it working. Viorel said the following:
Since LoadAsync does not perform the operation immediately, you should probably add a corresponding “.then”. See some code: https://social.msdn.microsoft.com/Forums/windowsapps/en-US/94fa9636-5cc7-4089-8dcf-7aa8465b8047. This sample uses “create_task” and “then”: https://code.msdn.microsoft.com/vstudio/StreamSocket-Sample-8c573931/sourcecode (file Scenario1.xaml.cpp, for example).
I have had to separate the content in the task<Streams::IRandomAccessStream^> and split it up in separate tasks.
I reconstructed my code and I have now the following:
String^ xmlFile = "Assets\\TheXmlFile.xml";
xml = ref new XmlDocument();
StorageFolder^ InstallationFolder = Windows::ApplicationModel::Package::Current->InstalledLocation;
task<StorageFile^>(
InstallationFolder->GetFileAsync(xmlFile)).then([this](StorageFile^ file) {
if (nullptr != file) {
task<Streams::IRandomAccessStream^>(file->OpenAsync(FileAccessMode::Read)).then([this](Streams::IRandomAccessStream^ stream)
{
IInputStream^ deInputStream = stream->GetInputStreamAt(0);
DataReader^ reader = ref new DataReader(deInputStream);
reader->InputStreamOptions = InputStreamOptions::Partial;
create_task(reader->LoadAsync(stream->Size)).then([reader, stream](unsigned int size){
strXml = reader->ReadString(stream->Size);
MessageDialog^ dlg = ref new MessageDialog(strXml);
dlg->ShowAsync();
});
});
}
});