DLP data scan from bigquery table showing start byte as null - google-cloud-platform

I have scanned a Bigquery table from Google DLP Console. The scan results are saved back into a big query table. DLP has identified sensitive information, but the start byte is shown as null, can anyone help me understand why?
The source data looks as follows:
2,james#example.org ,858-333-0333,333-33-3333,8
3,mallory#example.org,858-222-0222,222-22-2222,8
4,maria#example.org ,858-444-0444,444-44-4444,1
------------------------------
If I put the same data in Cloud storage bucket and then perform a scan using DLP, I get the start and end bytes for the sensitive data

Thanks folks, the product team is investigating. What's happening is that "0" is mapping to null "by accident" due to a proto to BQ schema conversion bug on our end. We'll address this.

Unfortunatelly this looks like a bug.
I was able to reproduce your issue completely; I fallowed these steps:
screated a source csv file:
1,mail1#test.com,858-333-0333,333-33-3333,8
2,epaweda-8101#yopmail.com,858-333-0334,333-33-3334,3
3,petersko#live.com,858-333-0335,333-33-3335,5
4,danneng#gmail.com,858-333-0336,333-33-3336,1
5,chance#icloud.com,858-333-0337,333-33-3337,4
imported it to a BQ table - it looks like this:
DLP'ed it and got the same result with null column:
In my opinion this is a bug (certainly looks like it) so my recommendation would be to go to Google's Issuetracker and report it here (with as much details as possible) and wait for an answer.

Related

Trying to find how a BigQuery table was deleted by searching the audit log

A big query table was accidentally deleted. Fortunately we sink all our BQ audit logs into a dataset.
But I'm seeing some unexpected results. I was not seeing any delete operations for the table, then I broadened the scope of the query and found I could not see any ops for the table in the last 90 days.
I want to confirm my query is doing what I think it does. If this returns nothing does it really mean this table has not been touched in the last 90 days?
WHERE DATE(timestamp) > timestamp_add(current_datetime, interval -90 day) AND
resource.labels.project_id = "myproject" AND
resource.type='bigquery_resource' AND
protopayload_auditlog.resourceName LIKE '%MyTable%'
LIMIT 10
I should add if I swap out MyTable with another table in the above query I can get results so I don't think it's a syntax issue.
Thinking about this more: could it be that the table was truncated in a way that was not considered an "admin" operation?
We sink the following logs into the dataset I'm searching:
cloudaudit_googleapis_com_activity
cloudaudit_googleapis_data_access
cloudaudit_googleapis_system_event
Syntax looks ok. I recommend you to try larger intervals to confirm the query behaves as expected. Assuming the table is not empty, by increasing the days you must see something eventually.
I was right, it was a different operation that removed the table. I found it in the system log. It was removed due to a InternalTableExpired event.
SELECT
resource.labels.project_id,
protopayload_auditlog.resourceName,
protopayload_auditlog.methodName,
protopayload_auditlog.authenticationInfo.principalEmail,
protopayload_auditlog.requestMetadata.callerIp
FROM
`bombora-bi-prod.BomboraAuditLogs.cloudaudit_googleapis_com_system_event`
WHERE
protopayload_auditlog.resourceName LIKE '%datasets/MyDataset/tables/MyTable%'
LIMIT 100

AWS GlueStudio 'datediff'

Has anyone tried using AWS GlueStudio and the custOm SQL queries? I am currently trying to find the difference in days between to dates like so..
select
datediff(currentDate, expire_date) as days_since_expire
But in the data preview window I get an
AnalysisException: cannot resolve 'currentDate' given input columns: []; line 3 pos 9; 'Project ['datediff('nz_eventdate, 'install_date) AS days_since_install#613] +- OneRowRelation
Does anyone know how to fix this solution or what causes it?
You don't write PostgreSQL/T/PL (or any other flavor) SQL, instead "you enter the Apache SparkSQL query". Read the following carefully:
Using a SQL query to transform data (in AWS Glue "SQL Query" transform task)
https://docs.aws.amazon.com/glue/latest/ug/transforms-sql.html
The functions you can write in AWS Glue "SQL Query" transform task to achieve desired transformation are here (follow correct syntax):
https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
BTW: The error you wrote is not correlating with your select statement for many potential reasons, but I am writing this answer anyway just for the sake of your question heading or other who may come here.

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Google Bigquery: Join of two external tables fails if one of them is empty

I have 2 external tables in BiqQuery, created on top of JSON files on Google Cloud Storage. The first one is a fact table, the second is errors data - and it might or might not be empty.
I can query each table separately just fine, even an empty one - here is an
empty table query result example
I'm also able to left join them if both of them are not empty.
However, if errors table is empty, my query fails with the following error:
The query specified one or more federated data sources but not all of them were scanned. It usually indicates incorrect uri specification or a 'limit' clause over a union of federated data sources that was satisfied without having to read all sources.
This situation isn't covered anywhere in the docs, and it's not related to this versioning issue - Reading BigQuery federated table as source in Dataflow throws an error
I'd rather avoid converting either of this tables to native, since they are used in just one step of the ETL process, and this data is dropped afterwards. One of them being empty doesn't look like an exceptional situation, since plain select works just fine.
Is some workaround possible?
UPD: raised an issue with Google, waiting for response - https://issuetracker.google.com/issues/145230326
It feels like a bug. One workaround is to use scripting to avoid querying the empty table:
DECLARE is_external_table_empty BOOL DEFAULT
(SELECT 0 = (SELECT COUNT(*) FROM your_external_table));
-- do things differently when is_external_table_empty is true
IF is_external_table_empty = true
THEN ...
ELSE ...
END IF

DataPrep: access to source filename

Is there a way to create a column with filename of the source that created each row ?
Use-Case: I would like to track which file in a GCS bucket resulted in the creation of which row in the resulting dataset. I would like a scheduled transformation of the files contained in a specific GCS bucket.
I've looked at the "metadata article" on GCP but it is pretty useless for my use-case.
UPDATED: I have opened a feature request with Google.
While they haven't closed that issue yet, this was part of the update last week.
There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps.
There are some caveats, such as it not returning a value for BigQuery sources and not persisting through pivot, join, or unnest . . . but it covers the vast majority of use cases handily, and in other cases you just need to materialize it before some of those destructive transforms.
NOTE: If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148