I have a system exporting data as XML 2003 Worksheet. I need to load it to Bigquery through datafusion or any other process using GCP resources.
So
Is it possible to complete this with DataFusion
I have followed the process for XML transformation in https://www.youtube.com/watch?v=e-5K4cxwGrc&feature=youtu.be. So far I have reached a point where the header and data rows appear in different rows but same column. I am not able to parse it any further(using Wrangler) to individual columns as it just keeps isolating the json key:value pairs in different rows but same column
As I am new to datafusion, appreciate some detailed guidance.
This can be implemented using Data Fusion.
Basically, once you have the file (either uploaded directly or connecting using a source) and use the transformation XML to JSON, you can add a parsing operation for the JSON so it will be parsed into columns [1]. This will add another Transformation in the wrangler.
Additionally, I would suggest that you take a look at the documentation for Data Fusion in GCP which is very self-explanatory [2].
[1]- Column transformations -> Parse -> JSON
[2]- https://cloud.google.com/data-fusion/docs
Related
I'm trying to establish if my planned way of working is correct.
I have two data sources; a MySql & MSSQL database. I need to combine these data sources and expose this data for Power BI to consume.
I've decided to use Azure Synapse Analytics for the ETL and would like to understand if there is anything in the process I can simplify or do better.
The process is as followed:
MySql & MSSQL delta loaded into ASA as parquet format, stored in Azure Gen 2 Storage.
Once copy pipeline is complete a subsiquent data flow unions the data from the two sources and inserts into MSSQL storage in ASA.
BI Consumes from this workspace / data soruce.
I'm not sure if I should be storing from the data sources to Azure Gene 2, or I should just perform the transform and insert from the source straight into the MSSQL storage. Any thoughts or suggestions would be greatly appreciated.
The pattern that you're following is the data lake pattern, where data is moved between 3 zones:
Raw
Enriched
Curated
The Raw zone keeps an original copy of the data before transformation. The benefit of storing the data this way (as parquet files, here) is so that you can troubleshoot a problem with the transformation or create a different transformation to address a new need.
The Enriched zone is where you have done some transformation, like UNIONing your data, or providing some other clean up steps, maybe removing unneeded columns, correcting addresses, etc. You have done this by inserting the data into a SQL database, but this might also be accomplished by using views in the serverless pool, if the transformations are simple enough: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/create-use-views
The Curated zone is a place to transform your data into a form that BI applications will do well with, i.e. a star schema. Even if this is a very simple dataset, it will be well worth incorporating a date dimension, which will yield a lot of benefits in Power BI. The bottom line here is that Power BI is optimized to work with star schemas, so that's what you should give it.
You do not need to use data lake technologies to follow this pattern and still get the benefits. As far as whether what you are doing is good will be based on how everything performs versus how simple you can keep it.
Here's more on the topic: https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/best-practices/data-lake-overview
Once copy pipeline is complete a subsiquent data flow unions the data
from the two sources and inserts into MSSQL storage in ASA
What is the use MSSQL storage ? Is it only used by PowerBI to create reports , if yes then you can use ADLS gen2 , as it will be cheaper, ( basically very in line with Mark said above as "curated"
Just one more thing to consider , PowerBI can read data from both the sources and then do the transformation within itself.
I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.
I have a huge amount of log data exported from StackDriver to Google Cloud Storage. I am trying to run queries using BigQuery.
However, while creating the table in BigQuery Dataset I am getting
Invalid field name "k8s-app".
Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Table: bq_table
A huge amount of log data is exported from StackDriver sinks which contains a large number of unique column names. Some of these names aren't valid as per BigQuery tables.
What is the solution for this? Is there a way to query the log data without cleaning it? Using temporary tables or something else?
Note: I do not want to load(put) my data into BigQuery Storage, just to query data which is present in Google Cloud Storage.
* EDIT *
Please refer to this documentation for clear understanding
I think you can go any of these two routes based on your application:
A. Ignore Header
If the problematic field is in the header row of your logs, you can choose to ignore the header row by adding the --skip_leading_rows=1 parameter in your import command. Something like:
bq location=US load --source_format=YOURFORMAT --skip_leading_rows=1 mydataset.rawlogstable gs://mybucket/path/* 'colA:STRING,colB:STRING,..'
B. Load Raw Data
If the above is not applicable, then just simply load the data in its un-structured raw format into BigQuery. Once your data is in there, you can go about doing all sorts of stuff.
So, first create a table with a single column:
bq mk --table mydataset.rawlogstable 'data:STRING'
Now load your dataset in the table providing appropriate location:
bq --location=US load --replace --source_format=YOURFORMAT mydataset.rawlogstable gs://mybucket/path/* 'data:STRING'
Once your data is loaded, now you can process it using SQL queries, and split it based on your delimiter and skip the stuff you don't like.
C. Create External Table
If you do not want to load data into BigQuery but still want to query it, you can choose to create an external table in BigQuery:
bq --location=US mk --external_table_definition=data:STRING#CSV=gs://mybucket/path/* mydataset.rawlogstable
Querying Data
If you pick option A and it works for you, you can simply choose to query your data the way you were already doing.
In the case you pick B or C, your table now has rows from your dataset as singular column rows. You can now choose to split these singular column rows into multiple column rows, based on your delimiter requirements.
Let's say your rows should have 3 columns named a,b and c:
a1,b1,c1
a2,b2,c2
Right now its all in the form of a singular column named data, which you can separate by the delimiter ,:
select
splitted[safe_offset(0)] as a,
splitted[safe_offset(1)] as b,
splitted[safe_offset(2)] as c
from (select split(data, ',') as splitted from `mydataset.rawlogstable`)
Hope it helps.
To expand on #khan's answer:
If the files are JSON, then you won't be able to use the first method (skip headers).
But you can load each JSON row raw to BigQuery - as if it was a CSV - and then parse in BigQuery
Find a full example for loading rows raw at:
https://medium.com/google-cloud/bigquery-lazy-data-loading-ddl-dml-partitions-and-half-a-trillion-wikipedia-pageviews-cd3eacd657b6
And then you can use JSON_EXTRACT_SCALAR to parse JSON in BigQuery - and transform the existing field names into BigQuery compatible ones.
Unfortunately no!
As part of log analytics, it is common to reshape the log data and run few ETL's before the files are committed to a persistent sink such as BigQuery.
If performance monitoring is all you need for log analytics, and there is no rationale to create additional code for ETL, all metrics can be derived from REST API endpoints of stackdriver monitoring.
If you do not need fields containing - you can set up to ignore ignore_unknown_values. You have to provide the schema you want and using ignore_unknown_values any field not matching the schema will be ignored.
I am facing some challenge to create multiple PDF files in informatica 10.2.0, Please find the details below:
Requirement : - We need to spilt single xml file into multiple PDF files based on condition.
Tools used : - Informatica Powercenter, Informatica Developer.
Challenge : - I have created data processor in informatica developer and used this as service in informatica powercenter and created single PDF file
But not able to create multiple PDF files with this service. I have used sorter to sort the file based on my condition to split the file and then used transaction control to commit records based on the condition used and passing these records to the UDT transformation (calling service in this transformation) and then passing the output Buffer port to target.
Here in UDT transformation I have given Input type and Output type as 'file' in UDT settings tab in mapping.
Could anyone please provide suggestion to achieve a solution for this technical challenge.
create a variable to identify when a change or new xml is processed and use this to issue commit/continue in the Transaction control.
I think the transaction control is probably the wrong route to use here - try this https://kb.informatica.com/howto/6/Pages/1/154788.aspx
I need to build a data pipeline which takes input from a CSV file (stored on S3) and "updates" records in the Aurora RDS table. I understand the standard format (out of the box template) for bulk record insertion, but for the records update or deletion, is there any standard way to have those statements in the SqlActivity?
I can write an update statement, but then the way CSV inputs are referenced, they are just question marks (?) without any liberty to index a column.
Let me know if data pipeline can be used in this way? If yes any specific way I can refer CSV columns? Thanks in advance!
You will need to do some preprocessing of your CSV to a SQL script containing your bulk updates and then invoke the SqlActivity with a reference to your script.
If you have inserts you might be able to perform this by using the following:
CopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-copyactivity.html) which takes:
S3DataNode as an input
SqlDataNode as the output.
If performance is not a concern then this is the closest you can get to an out of the box transport using AWS Data Pipeline.
You can refer to the AWS Data Pipeline docs (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html) for more information.