How to read json lines into arrow::Table in c++? - apache-arrow

The support for line-separated JSON in c++ seems never completed from git log.
example JSON lines:
https://jsonlines.org/examples/
relevant docs in arrow:
https://arrow.apache.org/docs/cpp/json.html#basic-usage
I am trying to compare an arrow table with a line-delimited json [1].
Several approaches I tried:
read the line-delimited json into an arrow::Table. Because we do not know the schema of the json (e.g. what are the keys in the json and what are the types of the values), there does not seem a way to construct the arrow::schema
Convert the arrow::Table to line-delimited json by converting each row of the arrow::Table into a JSON line. arrow::Table is in columnar format so it seems we have to traverse columns to get elements by index.
[1] https://arrow.apache.org/docs/python/json.html

Related

AWS DynamoDB fails at The provided key element does not match the schema because of CSV format

I am working on importing data from S3 bucket to DynamoDB using datapipeline. The data is in csv format. I have been struggling with this for a week now, and finally came to know the real problem.
I have some fields, the important ones are (id (partitionKey), username (sortKey)) .
Now one of the entry in data is username has been set as a comma seperated value. ForExample: {"username": ""someuser,name"} . Now the irony of csv file is when mapping to dynamodb through csv (comma seperated) file. It takes comma as a new entry in the column. And so, it fails with the error The provided key element does not match the schema. Which will ofcourse right.
Is there any way I can overcome this issue? Thanks in advance for your suggestions.
EDIT:
The csv entry looks like this as an example.
1234567,"user,name",$123$,some#email.de,2002-05-28 14:07:04.0,2013-07-19 14:17:05.0,2020-02-19 15:32:18.611,2014-02-27 14:49:19.0,,,,

Athena Query Results: Are they always strings?

I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!
The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.

AWS glue: ignoring spaces in JSON properties

I have a dataset with JSON files in it. Some of the entries of these JSONs have spaces in the entries like
{
'propertyOne': 'something',
'property Two': 'something'
}
I've had this data set crawled by several different crawlers to try and get the schema I want. For some reason on one of my crawls, the spaces were removed, but on trying to replicate the process, I cannot get the spaces to be removed and when querying in Athena I get this error
HIVE_METASTORE_ERROR: : expected at position x in 'some string' but ' ' found instead.
Position x is the position of the space between 'property' and 'Two' in the JSON entry.
I would like to just be able to exclude this field or have the space removed when crawled, but I'm not sure how. I can't change the JSON format. Any help is appreiated
this is actually a bug with aws gule json classifier because it doesn't play nice with nested properties that have spaces in them. The syntax error is in the schema generated by the crawler, not in the json. It generates something like this:
struct<propertyOne:string, property Two:string>
The space in "property two" should have been escaped, by the crawler. At this point, generating the DDL for the table is also not working. We are also facing this issue and looking for workarounds
I believe your only option, in this case, would be to create your own custom JSON classifier to select only those attributes you want the Crawler to add to the Data Catalog.
I.e. if you want to only retrieve propertyOne you can use specify the JSONPath expression as $.propertyOne.
Note also that your JSON should be in double quotes, the single quotes could also be causing issues when parsing the data.

Parse JSON as key value in Dataflow job

How to parse JSON data in apache beam and store in bigquery table ?
For example: JSON data
[{ "name":"stack"},{"id":"100"}].
How to parse JSON data and convert to PCollection K,V that will store in BQ table?
Appreciate your help!!
Typically you would use a built in JSON parser in the programming language (Are you using beam or python). Then create a TableRow object and use that for the PCollection which you are passing to the BQ table.
Note: Some JSON parsers disallow JSON which starts with a root list, as you have shown in your example. They tend to prefer something like this, with a root map. I believe this is the case in python's json library.
{"name":"stack", "id":"100"}
Please see this example pipeline, for an example on how to create the PCollection and use BigqueryIO.
You may also want to consider using one of the X to BigQuery template pipelines.

How to insert multiple values eg JSON encoding in value parameter of put() method in leveldb in c++

I have been trying to insert key value pairs in database using leveldb and it works fine with simple strings. However if I want to store multiple attributes for a key or for example use JSON encoding , how can it be done in c++ . In Node.js leveldb package it can be done by specifying the encoding . Really can't figure this out
JSON is just a string, so I'm not completely sure where you're coming from here…
If you have some sort of in-memory representation of JSON you'll need to serialize it before writing it to the database and parse it when reading.