I am capturing Eventhub logs into a storage account but it shows in avro encoding format. I am expecting in Json - azure-eventhub

I am capturing Eventhub logs in to storage account for cold storage. But it shows me in "Avro" format. I want it to be logged in JSON format. Please suggest how can it be done, without an extra step to convert AVRO to JSON

No, you cannot. The Avro format is by design and cannot be changed. The only way is that convert Avro to json.

Related

Best way to get pub sub json data into bigquery

I am currently trying to ingest numerous types of pub sub-data (JSON format) into GCS and BigQuery using a cloud function. I am just wondering what is the best way to approach this?
At the moment I am just dumping the events to GCS (each even type is on its own directory path) and was trying to create an external table but there are issues since the JSON isn't newline delimited.
Would it be better just to write the data as JSON strings in BQ and do the parsing in BQ?
With BigQuery, you have a brand new type name JSON. It helps you to query more easily JSON data type. It could be the solution if you store your event in BigQuery.
About your questions about the use of Cloud Functions, it depends. If you have a few events, Cloud Functions are great and not so much expensive.
If you have an higher rate of event, Cloud Run can be a good alternative to leverage concurrency and to keep the cost low.
If you have million of event per hour or per minute, consider Dataflow with the pubsub to bigquery template.

Glue Crawler does not recognize Timestamps

I have JSON files in an S3 Bucket that may change their schema from time to time. To be able to analyze the data I want to run a glue crawler periodically on them, the analysis in Athena works in general.
Problem: My timestamp string is not recognized as timestamp
The timestamps currently have the following format 2020-04-06T10:37:38+00:00, but I have also tried others, e.g. 2020-04-06 10:37:38 - I have control over this and can adjust the format.
The suggestion to set the serde parameters might not work for my application, I want to have the scheme completely recognized and not have to define each field individually. (AWS Glue: Crawler does not recognize Timestamp columns in CSV format)
Manual adjustments in the table are generally not wanted, I would like to deploy Glue automatically within a CloudFormation stack.
Do you have an idea what else I can try?
This is a very common problem. The way we got around the problem when reading text/json files is we had an extra step in between to cast and set proper data types. The crawler data types are a bit iffy sometimes and is based on the data sample available at that point in time

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

AWS Glue Crawler - Not picking up Timestamp column correctly (always defined as string)

I've setup an AWS Glue crawler to index a set of bucketed CSV files in S3 (which then create an Athena DB).
My timestamp is in "Java" format - as defined in the documentation, example;
2019-03-07 14:07:17.651795
I've tried creating a custom classifier (and a new crawler) yet this column keeps being detected as a "string" and not a "timestamp".
I'm at a loss why Athena / Glue won't detect this as a timestamp..
I think the problem may be due to the fractional seconds in the timestamp. I found this StackOverflow answer that contains the patterns recognized as timestamps by Glue (but I haven't found where the patterns come from, I can't find them in the Glue docs).
You might have better luck using a custom classifier to make it understand your timestamp format.
I don't know how much it will help you since you also have to convince Athena to parse your timestamps. You might be better off letting Glue classify them as strings and create a view where you use DATE_PARSE to convert the strings to timestamps.

Querying timestamp data in Athena when the timestamp format in the underlying JSON files has changed

I'm querying data in AWS Athena from JSON files stored in S3. I've loaded all the JSON files into Athena using AWS Glue, and it's been working perfectly so far. However, the timestamp formatting has changed in the JSON files from
2018-03-23 15:00:30.998
to
2018-08-29T07:59:50.568Z
So the table ends up having entries like this
2018-08-29T07:59:42.803Z
2018-08-29T07:59:42.802Z
2018-08-29T07:59:32.500Z
2018-03-23 15:03:43.232
2018-03-23 15:03:44.697
2018-03-23 15:04:11.951
This results in parsing errors when I try to run queries against the full DB.
How do I accommodate this in AWS Glue (or Athena), so I don't have to split up the data when querying? I've tried looking into custom classifiers, but I'm unsure of how to use them in this particular case.
Thanks in advance.
Unfortunately you have to unify the data. If you decide to use "2018-08-29T07:59:50.568Z" format you can read such data by using org.apache.hive.hcatalog.data.JsonSerDe library with the following serde property: 'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'