Using Dataprep to write to just a date partition in a date partitioned table - google-cloud-platform

I'm using a BigQuery view to fetch yesterday's data from a BigQuery table and then trying to write into a date partitioned table using Dataprep.
My first issue was that Dataprep would not correctly pick up DATE type columns, but converting them to TIMESTAMP works (thanks Elliot).
However, when using Dataprep and setting an output BigQuery table you only have 3 options for: Append, Truncate or Drop existing table. If the table is date partitioned and you use Truncate it will remove all existing data, not just data in that partition.
Is there another way to do this that I should be using? My alternative is using Dataprep to overwrite a table and then using Cloud Composer to run some SQL pushing this data into a date partitioned table. Ideally, I'd want to do this just with Dataprep but that doesn't seem possible right now.
BigQuery table schema:
Partition details:
The data I'm ingesting is simple. In one flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-08 | Josh1 |
| 2018-08-08 | Josh2 |
+------------+--------+
In the other flow:
+------------+--------+
| date | name |
+------------+--------+
| 2018-08-09 | Josh1 |
| 2018-08-09 | Josh2 |
+------------|--------+
It overwrites the data in both cases.

You ca create a partitioned table bases on DATE. Data written to a partitioned table is automatically delivered to the appropriate partition.
Data written to a partitioned table is automatically delivered to the appropriate partition based on the date value (expressed in UTC) in the partitioning column.
Append the data to have the new data added to the partitions.
You can create the table using the bq command:
bq mk --table --expiration [INTEGER1] --schema [SCHEMA] --time_partitioning_field date
time_partitioning_field is what defines which field you will be using for the partitions.

Related

AWS Dynamo DB: High read latency using query on GSI

I have a dynamo db table which contains Date, City and other attributes as the columns. I have configured GSI with Date as the hash key. The table contains 27 attributes from 350 cities recorded daily.
| Date | City | Attribute1 | Attribute27|
+------------+------------+-------------+------------+
| 25-06-2020 | Boston | someValue | someValue |
| 25-06-2020 | NY | someValue | someValue |
| 25-06-2020 | Chicago | someValue | someValue |
+------------+------------+-------------+------------+
I have a Lambda proxy integration setup in API Gateway. The lambda function receives a 7 day date range as the request. Each of the date, in this range is used query the dynamodb (using query input) to get all the items for a given day. The result for each day is consolidated for a week, and is then sent back as a JSON response.
The latency seen in POSTMAN is around 1.5s, after increasing the lambda memory to 1024MB (Even though, only 76MB is being consumed).
Is there any way to improve the performance? The dynamo db is already running in On-Demand Capacity.
You don't say if you are using parallel queries or not.
If not do so.
You also don't say what cloudwatch is showing for Query latency, as mentioned by Marcin, DAX can help reduce that.
You also don't mention what cloudwatch is showing for lambda execution. There's various articles about optimizing lambda.
Whatever's left is networking...not much you can do about that..one piece to consider is reusing Db connections in your lambda

How S3 select pricing works? What is data returned and scanned in s3 select means

I have a 1M rows of CSV data. select 10 rows, Will I be billed for 10 rows. What is data returned and data scanned means in S3 Select?
There is less documentation on these terms of S3 select
To keep things simple lets forget for some time that S3 reads in a columnar way. Suppose you have the following data:
| City | Last Updated Date |
|------------|---------------------|
| London | 1st Jan |
| London | 2nd Jan |
| New Delhi | 2nd Jan |
A query for fetching the latest update date
forces S3 to scan all 3 records
but the returned records are only 2 (when the last updated date is 2nd Jan)
A query of select city where last updated date is 1st Jan,
will scan all 3 rows
but return only 1 string - "New Delhi".
Hence based on your query, it might scan more data (3 rows) but return less data (2 rows).
I hope you understand the difference between Data Scanned and Data Returned now.

Querying on Bigquery repeated fields

Below is the schema of my BigQuery table. I am selecting the sentence_id, store and BU_model and inserting data into another table in BigQuery. The datatypes for the new table generated are integer, repeated and repeated respectively.
I want to flatten/unnest the repeated fields so that they are created as STRING fields in my second table. How could this be achieved using standard sql?
+- sentences: record (repeated)
| |- sentence_id: integer
| |- autodetected_language: string
| |- processed_language: string
| +- attributes: record
| | |- agent_rating: integer
| | |- store: string (repeated)
| +- classifications: record
| | |- BU_Model: string (repeated)
The query that I am using to create the second table is as below. I would want to query on the BU_Model as a STRING column.
SELECT sentence_id ,a.attributes.store,a.classifications.BU_Model
FROM staging_table , unnest(sentences) a
Expected Output should look like:
Staging table:
41783851 regions Apparel
district Footwear
12864656 regions
district
Final Target Table:
41783851 regions Apparel
41783851 regions Footwear
41783851 district Apparel
41783851 district Footwear
12864656 regions
12864656 district
I tried the below query and it seems to work as expected, but this means that i would have to unnest every expected repeated field. My table in Bigquery has 50+ columns which are repeated. Is there a easier way around this ?
SELECT
sentence_id,
flattened_stores,
flattened_Model
FROM `staging`
left join unnest(sentences) a
left join unnest(a.attributes.store) as flattened_stores
left join unnest(a.classifications.BU_Model) as flattened_Model
Assuming you want still three columns in your output - with arrays being flattened into string
SELECT sentence_id ,
ARRAY_TO_STRING(a.attributes.store, ',') store,
ARRAY_TO_STRING(a.classifications.BU_Model, ',') BU_Model
FROM staging_table , unnest(sentences) a
UPDATE to address recent changes in question
In BigQuery Standard SQL - use of LEFT JOIN UNNEST() (as you did in your last query) is the most reasonable way to do what you want to get as a result
In BigQuery Legacy SQL - you can use FLATTEN syntax - but it has same drawback of needing to repeat same for all 50+ column
Very simplified example:
#legacySQL
SELECT sentence_id, store, BU_Model
FROM (FLATTEN([project:dataset.stage], BU_Model))
Conclusion: I would go with LEFT JOIN UNNEST() approach

PowerBi: incremental data load by using OData feed

is there any possibility to save previous data before overriding because of refreshing data?
Steps i have done:
I Created a table and appended to table A
Created a Column called DateTime with the function
DateTime.LocalNow()
Now i have a problem how to save previous data before the refreshing phase. I need to preserve the timestamp of previous data and actually data.
Example giving:
Before refreshing:
Table A:
|Columnname x| DateTime | ....
| value | 23.03.2016 23:00
New Table:
|Columnname x| DateTime | ....
| value | 23.03.2016 23:00
After refreshing:
Table A:
|Columnname x| DateTime | ....
| value | 23.03.2016 23:00
| value 2 | 23.03.2016 23:01
New Table:
|Columnname x| DateTime | ....
| value | 23.03.2016 23:00
| value 2 | 23.03.2016 23:01
kind regards
Incremental refreshes in the Power BI Service or Power BI Desktop aren't currently supported. But please vote for this feature. (update: see that link for info on a preview feature that does this)
If you need this behavior you need to load these rows to a database then incrementally load the database. The load to Power BI will still be a full load of the table(s).
This is now available in PowerBI Premium
From the docs
Incremental refresh enables very large datasets in the Power BI Premium service with the following benefits:
Refreshes are faster. Only data that has changed needs to be refreshed. For example, refresh only the last 5 days of a 10-year dataset.
Refreshes are more reliable. For example, it is not necessary to maintain long-running connections to volatile source systems.
Resource consumption is reduced. Less data to refresh reduces overall consumption of memory and other resources.

Syncframework:Map single table into multiple tables

I have two tables like the fallowing:
On server:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| ServerName | Product
| Quantity
On client:
| Orders Table | OrderDetails Table
-------------------------------------------------------------------------------------
| Id | Id
| OrderDate | OrderId
| Product
| Quantity
| ClientName
I need to sync the [Server].[Orders Table].[ServerName] to [Client].[OrderDetails Table].[ClientName]
The Question:
What is the true and efficient way of making it?
I know Deprovisioning and provisioning with different config, is one way of doing it.
So I just wanna know the correct way.
Thanks.
EDIT :
Other columns of each table should sync normally ([Server].[Orders Table].[Id] to [Client].[Orders Table].[Id] ...).
And mapping strategy sometimes changes based on the row of data (which which is sending/receiving).
Sync Fx is not an ETL tool. simply put, it's DB sync is per table.
if you really want to force it to do what you want, you can simply intercept ChangesSelected event for the OrderDetails table, lookup the extra column from the other table and then dynamically add the column to the dataset before it gets applied on the other side.
see this link on how to manipulate the change dataset