GCP - load table from file and variable in BQ routine

GCP - load table from file and variable in BQ routine - google-cloud-platform

I want to load table from file and variable . As the file schema is not same as table to be loaded hence extra columns needs to be filled by variable inside stored procedure.
Like below example pty is not part of csv file and other 2 columns mt and de are part of file.
set pty = 'sss';
LOAD DATA INTO `###.Tablename`
(
pty STRING ,
mt INTEGER ,
de INTEGER
)
FROM FILES
(
format='CSV',
skip_leading_rows=1,
uris = ['gs://###.csv']
);

I think you can do that on 2 steps and 2 queries :
LOAD DATA INTO `###.Tablename`
FROM FILES
(
format='CSV',
skip_leading_rows=1,
uris = ['gs://###.csv']
);
update `###.Tablename`
set pty = "sss"
where pty is null;
If it's complicated for you to apply your logic with Bigquery and SQL, you can also create a Python script with Google Biguery client and Google storage client.
You script loads the csv file
Transforms results to a list of Dict
Add extra fields to each element of the Dict with your code logic
Load the result Dicts to Bigquery

Related

AWS Athena returning empty rows on table creation

I'm trying to create a table in AWS Athena query editor using this statement:
CREATE EXTERNAL TABLE IF NOT EXISTS somedb.sometable (
meta string,
content string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = "\"([^\\n]*)\\n([^>]*)\"g"
)
LOCATION 's3://some-location/';
The file I'm trying to process looks something like this:
>some metadata
content line
content line
content line
content line
>some more metadata
content line
content line
more content lines
The goal is to create a table with two columns, one being metadata, the other one being multiline content described below the metadata. The regex that is tested using regex101, and seems to work properly.
The problem is that querying the data returns empty rows. Executing SELECT count(*) FROM "somedb"."sometable" returns the same number as there are lines in the file that is being processed. In my case file has 63000 lines and the count query returns 63000. Each row contains no data in it.
Also tried to create a table using the Athena table wizard, getting the same result.

Optimal ETL process and platform

I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek

If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio

Can kettle export BLOB data from a oracle table?

I have a oracle table where I have columns like Document (type BLOB), Extension ( VARCHAR2(10) with values like .pdf, .doc) and Document Description(VARCHAR2
(100)). I want to export this data and provide to my customer.
Can this be done in kettle ?
Thanks

I have a MSSQL database that stores images in a BLOB column, and found a way to export these to disk using a dynamic SQL step.
First, select only the columns necessary to build a file name and SQL statement (id, username, record date, etc.). Then, I use a Modified Javascript Value step to create both the output filename (minus the file extension):
outputPath = '/var/output/';
var filename = outputPath + username + '_' + record_date;
// --> '/var/output/joe_20181121'
and the dynamic SQL statement:
var blob_query = "SELECT blob_column FROM dbo.table WHERE id = '" + id + "'";
Then, after using a select to reduce the field count to just the filename and blob_query, I use a Dynamic SQL row step (with "Outer Join" selected) to retrieve the blob from the database.
The last step is to output to a file using Text file output step. It allows you to supply a file name from a field and give it a file extension to append. On the Content tab, all boxes are unchecked, the Format is "no new-line term" and the Compression is "None". The only field exported is the "blob_column" returned from the dynamic SQL step, and the type should be "binary".
Obviously, this is MUCH slower than other table/SQL operations due to the dynamic SQL step making individual database connections for each row... but it works.
Good luck!

How to delete a row from csv file on datalake store without using usql?

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?

You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.

Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell

WSO2 - Table created using Analytic Script Invisible in Gadget Generation Tool

My use case: pushes data from a stream configured in the ESB to BAM and create a report using “Gadget Generation Tool”
Publishing the stream from ESB to BAM after adding an agent to the proxy service worked fine.
From the stream I created a table using the Analytics->Add screen and the table seems to persist as I am able to do a select and see results from the same screen.
Now I am trying to generate a Dashboard using the Gadget Generation Tool but the table is not available, though the jdbc connection is working fine but the table is nowhere:
Script for Analytic Table run from Analytics->Add screen
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(creditkey STRING, creditFlag STRING, version STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" ,
"cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Tried looking for table in following databases:
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Have not done any custom db configurations.

Did you try jdbc:h2:repository/database/samples/WSO2CARBON_DB;AUTO_SERVER=TRUE? Also, what you have pasted is the Cassandra Storage Definition, probably used for getting the input, not persisting the output. If you give the full hive query, that would help to figure out the problem more.

Why did I not see the table in Gadget Generation tool?
The table I have created using the Hive script is a Casandra Distributed database table and the reference I gave in the Gadget generation tool while looking up for the table were from the h2 RDBMS database table.
Below are the references to the h2 RDBMS databse which comes out of box with WSO2
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Resolution ----- How to get tables listed in the Gadget Generation tool?
To get the tables listed in the Gadget Generation tool you have to extensively use the Hive Script to complete the following 3 steps:
Create a Hive table reference for the Casandra data stream to which data is pushed from ESB in my case.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(
payload_creditkey STRING, payload_creditFlag STRING, payload_version STRING) STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" , "cassandra.ks.username" = "admin" , "cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" , "cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Using Hive script create a H2 RDBMS script and reference to which I would be copying my data from the Casandra stream.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLEh2summary(
creditFlg STRING,
verSion STRING
)
STORED BY
'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
TBLPROPERTIES (
'mapred.jdbc.driver.class' = 'org.h2.Driver' ,
'mapred.jdbc.url' = 'jdbc:h2:C:/wso2bam-2.2.0/repository/samples/database/BAM_STATS_DB' ,
'mapred.jdbc.username' = 'wso2carbon' ,
'mapred.jdbc.password' = 'wso2carbon' ,
'hive.jdbc.update.on.duplicate' = 'true' ,
'hive.jdbc.primary.key.fields' = 'creditFlg' ,
'hive.jdbc.table.create.query' = 'CREATE TABLE CREDITTABLE_newh2(creditFlg VARCHAR(100), version VARCHAR(100))' );
Write a Hive query using which data would be copied from Casandra to H2[RDBMS]
insert overwrite table CREDITTABLEh2summary select a.payload_creditFlag,a.payload_version from CREDITTABLE a;
On doing this I was able to see the table in the Gadget Generation tool however I also had to chage the referenc to the H2 Database to absolute in the JDBC URL value that I passed.
Observation:
Was wondering if the Gadget generation tool can directly point to the Casandra Stream without having to copy the tables to a RDBMS database.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GCP - load table from file and variable in BQ routine - google-cloud-platform

Related

AWS Athena returning empty rows on table creation

Optimal ETL process and platform

Can kettle export BLOB data from a oracle table?

How to delete a row from csv file on datalake store without using usql?

WSO2 - Table created using Analytic Script Invisible in Gadget Generation Tool

Categories

Resources