What format is GCP's "Longrunningrecognize" output data in? - google-cloud-platform

I'm using Google Cloud's sample_long_running_recognize() to get audio transcripts as well as speaker diarization, but the output that it gives me is in LongRunningRecognizeResponse format which looks very similar to JSON, but not quite. How can I export the output of the LongRunningRecognizeResponse so I can put it in a pandas df?
I've tried to export it with
out = open(audio_in_file_path + "outputs/" + audio_in_file_name + "_out.json" , "w+")
out.write(response) # response is the output fyi
out.close()
but the format that the data is in is not actually JSON, so it messes up everything. I'm able to open the data on the console by calling the objects inside of response with something like response.results[1].alternatives[0] but I would much rather have it on a df.
Thanks in advance!

Indeed the data is similar in structure to JSON, since the Cloud Speech-to-Text API response is in JSON format.
However the Python Client Library creates a Python object [1] from the JSON response, which is harder to automatically parse into JSON format. You can create a JSON parsable object that is easy to save by iterating through the structure of LongRunningRecognizeResponse. For example like this:
for res in response.results:
for alt in res.alternatives:
rows.append({"transcript": alt.transcript, "confidence": alt.confidence})
with open("results.json" , "w+") as file:
json.dump(rows, file)
[1] https://googleapis.dev/python/speech/latest/gapic/v1/types.html#google.cloud.speech_v1.types.LongRunningRecognizeResponse

Related

how to structure input/formats for batch inference in sagemaker?

example provided in the aws documentation , https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html, states that the input csv can be structured like a sample below. I noticed for batch jobs in sagemaker, it can accept json as well. how to structure the json, does each record need to in a single line as shown in a csv example or can it be multiline?
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
...
It is recommended to make use of JSON Lines (i.e. each JSON to be on a single line). You can then set BatchStrategy to MultiRecord and SplitType to Line.
Batch Transform can then fit as many records in a mini-batch within the MaxPayloadInMB limit.
Kindly see the CreateTransformJob API for more information.

How to retrieve multiple NDJSON objects from the same file using ArduinoJson?

I am trying to use ArduinoJson to parse Google's quickdraw dataset, which contains .ndjson files with multiple objects inside. I figured how to retrieve the first of the objects in the file using the following simple code:
DeserializationError deserialization_error = ArduinoJson::deserializeJson(doc, as_cstr);
if (deserialization_error) {
printf("deserializeJson() failed: %s\n", deserialization_error.c_str());
}
However, this only parses the first object in the ndjson file.
According to the website, I get the sense that something else should happen automatically:
NDJSON, JSON Lines
When parsing a JSON document from an input stream, ArduinoJson stops reading as soon as the document ends (e.g., at the closing brace).
This feature allows to read JSON documents one after the other; for example, it allows to read line-delimited formats like NDJSON or JSON Lines.
{"event":"add_to_cart"}
{"event":"purchase"}
Is there some way to get the byte length of the parsed object to I can continue using the cstring to parse consecutive objects? I did print out the cstring and it does contain the entirety of the ndjson file.
I found it.
just call multiple times:
DeserializationError error = deserializeJson(doc, sceneFile);
or:
deserializeJson(docline1, sceneFile);
deserializeJson(docline2, sceneFile);
deserializeJson(docline3, sceneFile);

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.

What's caffe's input format?

I'm try to use caffe for audio recognition, but can't find a document for its input format.
I want to use leveldb, thus I must create a key and a value for each record, which is a pair of label string and data byte array.
It seems that no document describes this, and after I found the value is written by Datum.SerializeToString(), I can't find where Datum is and then lost.
Does anyone know how to convert non-image records into leveldb records for caffe? Thanks!
leveldb, lmdb and HDF5 are currently the main formats for feeding data into Caffe. The MemoryData layer enable in-memory input as well, so it's possible to use whatever input format and and use Caffe's python or c++ interfaces to populate the data blobs.
If you're already set on leveldb, this discussion on caffe issues could be useful.
Below is an example for populating a leveldb with python. It requires pycaffe and plyvel. It's adapted from caffe's github issues posted by Zackory. It's not specific to images as long as you represent each example in the form of a CxHxW where any or all can be equal to 1:
import caffe
db = plyvel.DB('train_leveldb/', create_if_missing=True, error_if_exists=True, write_buffer_size=268435456)
wb = db.write_batch()
count = 0
for file in dataset:
mat = # load numpy array from file
# Load matrix into datum object
datum = caffe.io.array_to_datum(mat)
wb.put('%08d_%s' % (count, file), datum.SerializeToString())
count += 1
# Write to db in regular intervals
if count % 1000 == 0:
# Write batch of images to database
wb.write()
del wb
wb = db.write_batch()
# Write last batch of images
if count % 1000 != 0:
wb.write()
I find constructing lmdb a lot simpler. lmdb example here.
The Datum object is defined with protobuf. See here:
https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L30-L41
It generates a file caffe.pb.h in .build_release/src/caffe/proto with the class Datum. You can have a look there to understand how this object works.

Export stata graph (data) to Excel?

Is there a simple way to export the "underlying" data of a Stata graph in order to reproduce that graph in MS Excel? Imagine you create a ROC curve using roctab y yhat, graph and you want to reproduce that graph in Excel.
I assume that you do not have access to the actual raw data that was used to compile the .gph in the first place, and somehow want to back engineer the .gph file... then, eek, good luck!
If you do however have the access to the data originally used then with new command available in Stata 13, You can use the function putexcel command
A more detailed description of the putexcel command can be found here stata press releasse on exporting tables to excel
The data in the .gph file are stored in the serset format between the and tags. There's no utility I know of that will parse the serset information, but it is very similar to Stata's dta file (v115 and below). I wrote up the basic file format information here. The Python library pandas has code for reading/writing dta files so with those you could probably create your own serset reader/writer.