BIGQUERY csv file load with an additional column with a default value - google-cloud-platform

From the example given by Google, I have managed to load CSV files into BigQuery(BQ) table following the guide(link and code below)
Now I want to add several files into BQ, and want to add a new column filename which contains the filename.
Is there a way to add column with default data?
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv
// Import the Google Cloud client libraries
const {BigQuery} = require('#google-cloud/bigquery');
const {Storage} = require('#google-cloud/storage');
// Instantiate clients
const bigquery = new BigQuery();
const storage = new Storage();
/**
* This sample loads the CSV file at
* https://storage.googleapis.com/cloud-samples-data/bigquery/us-states/us-states.csv
*
* TODO(developer): Replace the following lines with the path to your file.
*/
const bucketName = 'cloud-samples-data';
const filename = 'bigquery/us-states/us-states.csv';
async function loadCSVFromGCS() {
// Imports a GCS file into a table with manually defined schema.
/**
* TODO(developer): Uncomment the following lines before running the sample.
*/
// const datasetId = 'my_dataset';
// const tableId = 'my_table';
// Configure the load job. For full list of options, see:
// https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad
const metadata = {
sourceFormat: 'CSV',
skipLeadingRows: 1,
schema: {
fields: [
{name: 'name', type: 'STRING'},
{name: 'post_abbr', type: 'STRING'},
// {name: 'filemame', type: 'STRING', value=filename} // I WANT TO ADD COLUMN WITH FILE NAME HERE
],
},
location: 'US',
};
// Load data from a Google Cloud Storage file into the table
const [job] = await bigquery
.dataset(datasetId)
.table(tableId)
.load(storage.bucket(bucketName).file(filename), metadata);
// load() waits for the job to finish
console.log(`Job ${job.id} completed.`);
// Check the job's status for errors
const errors = job.status.errors;
if (errors && errors.length > 0) {
throw errors;
}
}

I would say you have a few choices.
Add a column to the CSV before uploading, e.g. with awk or preprocessing in JS.
Add the individual CSV files to separate tables. You can easily query across many tables as one in BigQuery. This way you can easily see what data comes from which file, and you can access table meta data for the file-name
Post process the data, by adding the column after the data is loaded with normal sql/api calls.
See also this possible duplicate How to add new column with metadata value to csv when loading it to bigquery

According to BigQuery’s documentation [1], there is no option to set a default value for columns. The closest option without any post-processing, would be to use a NULL value for nullable columns.
However, a possible postprocessing workaround for this would be to create a View of the raw table and add a script that maps the NULL value to any default value. Here’s some information about scripting in BigQuery [2].
In case it is possible to add a pre-processing code, adding the value to the source file would be easy to achieve using any scripting language.
I think that static and function-based values will be a good feature for BigQuery’s future scope.
[1] -
https://cloud.google.com/bigquery/docs
[2] -
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting

You have multiple options:
you could rebuild your CSV with the filename as a column data
you can load data into a temporary table, then moving to final table with a second step specifying the missing file name column
convert the example to be an external table where _FILE_NAME is a pseduocolumn, and later you can query and move to a final table. See more about this here.

Related

How to copy files from gdrive to s3 bucket using google scripts?

I created a Google Form with a linked Google Spreadsheet. I would like that everytime someone submits the form, the spreadsheet is copied to an s3 bucket in AWS. To do so, I just got started with Google Scripts. I managed to get the trigger part working on form submit but I am struggling to understand the readme of this GitHub project to upload to s3.
function setUpTrigger() {
ScriptApp.newTrigger('copyDataS3')
.forForm('1SK-2Ow63vs_TaoF54UjSgn35FL7F8_ANHDTOOiTabMM')
.onFormSubmit()
.create();
}
function copyDataS3() {
// https://github.com/viuinsight/google-apps-script-for-aws
// I do not understand where should I place aws.js and util.js.
// Should I do File -> New -> Script file and copy paste the contents? Should the file be .js or .gs?
S3.init("MY_ACCESS_KEY", "MY_SECRET_KEY");
// if I wanwt to copy an spreadsheet with the following id, what should go into "object" below?
var ssID = "SPREADSHEET_ID";
S3.putObject(bucketName, objectName, object, region)
}
I believe your goal as follows.
You want to send Google Spreadsheet to s3 bucket as a CSV data using Google Apps Script.
Modification points:
When I saw google-apps-script-for-aws of the library you are using, I noticed that the data is requested as the string. I thought that in this case, your CSV data might be able to be directly sent. But for example, when you want to sent a binary data, it will occur an error. So in this answer, I would like to propose the modified script of 2 patterns.
I thought that the situation might similar to this thread. But I noticed that you are using the different library from the thread. So I post this answer.
Pattern 1:
In this pattern, it supposes that only the text data is sent. It's like the CSV data in your replying. In this case, I think that it is not required to modify the library.
Modified script:
S3.init("MY_ACCESS_KEY", "MY_SECRET_KEY"); // Please set this.
var spreadsheetId = "###"; // Please set the Spreadsheet ID.
var sheetName = "Sheet1"; // Please set the sheet name.
var region = "###"; // Please set this.
var csv = SpreadsheetApp
.openById(spreadsheetId)
.getSheetByName(sheetName)
.getDataRange()
.getValues() // or .getDisplayValues()
.map(r => r.join(","))
.join("\n");
var blob = Utilities.newBlob(csv, MimeType.CSV, sheetName + ".csv");
S3.putObject("bucketName", "test.csv", blob, region);
Pattern 2:
In this pattern, it supposes that both the text data and binary data are sent. In this case, it is required to also modify the library side.
For google-apps-script-for-aws
Please modify the line 110 in s3.js as follows.
From:
var content = object.getDataAsString();
To:
var content = object.getBytes();
And, please modify the line 146 in s3.js as follows.
From:
Utilities.DigestAlgorithm.MD5, content, Utilities.Charset.UTF_8));
To:
Utilities.DigestAlgorithm.MD5, content));
For Google Apps Script:
In this case, please give the blob to S3.putObject as follows.
Script:
S3.init("MY_ACCESS_KEY", "MY_SECRET_KEY"); // Please set this.
var fileId = "###"; // Please set the file ID.
var region = "###"; // Please set this.
var blob = DriveApp.getFileById(fileId).getBlob();
S3.putObject("bucketName", blob.getName(), blob, region);
References:
viuinsight/google-apps-script-for-aws
Class UrlFetchApp
computeDigest(algorithm, value)
PutObject

Cannot create index on non-empty table

I'm currently using AWS Lambda (NodeJS) with AWS QLDB.
The scenario is like this.
I have the first table and its indexes when I deployed the service. So the table and indexes will be created. My problem is that, once I need to add new table and its indexes; it can't create the index because there's existing table.
My workaround to be able to create new table even if there's an existing table in my Ledger is that I'm querying the list of tables I have.
const getTables = async (transactionExecutor: TransactionExecutor) => {
const statement = `SELECT name FROM information_schema.user_tables`;
return await transactionExecutor.execute(statement);
};
Then I have this condition to check if the table is already existing
const tables = JSON.stringify(result.getResultList());
if (
!JSON.parse(tables).some((object): boolean => object.name === process.env.TABLE_NAME)
) {
console.log('TABLE A NOT EXISTING');
await createTable(transactionExecutor, process.env.TABLE_NAME);
}
if (
!JSON.parse(tables).some(
(object): boolean => object.name === process.env.TABLE_NAME_1,
)
) {
console.log('TABLE B NOT EXISTING');
await createTable(transactionExecutor, process.env.TABLE_NAME_1);
}
I don't know how to do it with indexes, I tried using SQL commands in QLDB but it's not working.
I hope you can help me.
Thank you
I'm not quite sure what your question is (the post title and body hint at different things), but I'm going to do my best to answer.
First, QLDB stores data in Ion, not JSON. So, please use the Ion APIs to parse data and not the JSON ones. The reason your code works at all is because Ion is a superset of JSON and the result set doesn't include types that are unknown to JSON. So, for example, if the result set was changed to include an Ion Timestamp, then your code would break.
Next, actually getting a list of tables has first class support in the driver. Simply use driver.getTableNames.
Third, I think you have a question "can I add an index to a non-empty table?". The answer is "no". This is planned functionality and I will update this answer when it is available. UPDATE: Now you can! https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-qldb-launches-index-improvements/
Finally, I think you're also asking if there is a way to list indexes on a table in the same way as you can list tables in a ledger. The answer to that is 'yes'. The documents returned in information_schema.user_tables look like this:
{
tableId:"...",
name:"THE_TABLE_NAME",
indexes:[
{
expr:"[THE_FIELD_BEING_INDEXED]"
}
],
status:"ACTIVE"
}

Can we change location from US to other region while reading data from Bigquery using Bigquery java library?

I am trying to read data from Bigquery using Bigquery java library.
My dataset is not in US location, so when i am giving my dataset name to library , it is throwing an error that dataset not found in US location because it searches by default in US location.
I have also tried giving the location using setLocation("asia-southeast1") but still it is finding in US location.
This is my code snippet:
val bigquery: BigQuery =BigQueryOptions.newBuilder().setLocation("asia-southeast1").build().getService
val query = "SELECT TO_JSON_STRING(t, true) AS json_row FROM "+dbName+"."+tableName+" AS t"
logger.info("Query is " + query)
val queryResult: QueryJobConfiguration = QueryJobConfiguration.newBuilder(query).build
val result: TableResult = bigquery.query(queryResult)
I am writing code in SCALA. As it uses same libraries as JAVA and JAVA is more popular, thats why I am asking this for JAVA.
Please help me to know that how I can change location from US to southeast.
Can I change something inside QueryJobConfiguration as i have searched a-lot but i am unable to find anything.
My only requirement is that I want final result as TableResult.
This is the exception being thrown
com.google.cloud.bigquery.BigQueryException: Not found: Dataset XXXXXXXX was not found in location US
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:584)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1203)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1198)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1197)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1181)
at com.google.cloud.bigquery.Job$1.call(Job.java:329)
at com.google.cloud.bigquery.Job$1.call(Job.java:326)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.poll(RetryHelper.java:64)
at com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:325)
at com.google.cloud.bigquery.Job.getQueryResults(Job.java:291)
at com.google.cloud.bigquery.BigQueryImpl.query(BigQueryImpl.java:1168)
...
Thanks in advance.
You shouldn't actually need to specify the location because BigQuery will infer it from the dataset being referenced in your query. See here.
When loading data, querying data, or exporting data, BigQuery
determines the location to run the job based on the datasets
referenced in the request. For example, if a query references a table
in a dataset stored in the asia-northeast1 region, the query job will
run in that region.
I just tested using the Java SDK on a dataset/table I created in asia-southeast1, and it worked without needing to explicitly specify the location.
If it's still not working for you by default (check the table you're referncing actually exists), then you can specify the location by setting it in the JobId and passing that to the overloaded method:
String query = "SELECT * FROM `grey-sort-challenge.asia_southeast1.a_table`;";
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(Boolean.FALSE)
.build();
JobId id = JobId.newBuilder().setLocation("asia-southeast1")
.setRandomJob()
.build();
try {
for (FieldValueList row : BIGQUERY.query(queryConfig, id).iterateAll()) {
for (FieldValue val : row) {
System.out.printf("%s,", val.toString());
}
System.out.printf("\n");
}
} catch (InterruptedException e) {
e.printStackTrace();
}

Drop column in Dynamo DB table

I've been looking through the AWS Dynamo DB documentation and the Amazon Dynamo interface and it seems like there's no way to remove a column from a table, outside of deleting the entire table with it's contents and starting over, is that true?
If so, why would Amazon not support this?
Try removing all data from that column, it will automatically remove that column.
Using document client with javascript, we can do this:
const paramsUpdate = {
TableName: tableName,
Key: { HashKey: 'hashKey' },
UpdateExpression: 'remove #c ',
ExpressionAttributeNames: { '#c': 'columnName' }
};
documentClient.update(paramsUpdate, (errUpdate) => {
if (errUpdate) log.error(errUpdate);
});
In here we set UpdateExpression with remove sentence
There is a REMOVE action in the DynamoDB API.
DynamoDB does not have a schema definition, and so there is no such thing as a "column". It also means there is no way to delete all attributes with the same name without iterating over each record.
A solution I recommend is to keep these attributes, and to make your code refer to that same data using a fresh attribute name.
For example, attribute content could become content_v2. It might not look so clean, but it's cheap, quick and your old data would be backed up.
Setting all instances of the column value to null clears the column.
In C#, this method does the trick using the persistence framework:
static void RemoveColumn()
{
var myItems = context.ScanAsync<MyObjectType>(null).GetRemainingAsync().Result;
// Foreach item, update
myItems.ForEach(myObject =>
{
myObject.UnwantedColumn = null;
context.Save(myObject);
});
}
Just remove all the data for that one column. On my end, it automatically refreshed, might have to refresh the page.

Search Informatica for text in SQL override

Is there a way to search all the mappings, sessions, etc. in Informatica for a text string contained within a SQL override?
For example, suppose I know a certain stored procedure (SP_FOO) is being called somewhere in an INFA process, but I don't know where exactly. Somewhere I think there is a Post SQL on a source or target calling it. Could I search all the sessions for Post SQL containing SP_FOO ? (Similar to what I could do with grep with source code.)
You can use Repository queries for querying REPO tables(if you have enough access) to get data related with all the mappings,transformations,sessions etc.
Please use the below link to get almost all kind of repo queries.Ur answers can be find in the below link.
https://uisapp2.iu.edu/confluence-prd/display/EDW/Querying+PowerCenter+data
select *--distinct sbj.SUBJECT_AREA,m.PARENT_MAPPING_NAME
from REP_SUBJECT sbj,REP_ALL_MAPPINGS m,REP_WIDGET_INST w,REP_WIDGET_ATTR wa
where sbj.SUBJECT_ID = m.SUBJECT_ID AND
m.MAPPING_ID = w.MAPPING_ID AND
w.WIDGET_ID = wa.WIDGET_ID
and sbj.SUBJECT_AREA in ('TLR','PPM_PNLST_WEB','PPM_CURRENCY','OLA','ODS','MMS','IT_METRIC','E_CONSENT','EDW','EDD','EDC','ABS')
and (UPPER(ATTR_VALUE) like '%PSA_CONTACT_EVENT%'
-- or UPPER(ATTR_VALUE) like '%PSA_MEMBER_CHARACTERISTIC%'
-- or UPPER(ATTR_VALUE) like '%PSA_REPORTING_HH_CHRSTC%'
-- or UPPER(ATTR_VALUE) like '%PSA_REPORTING_MEMBER_CHRSTC%'
)
--and m.PARENT_MAPPING_NAME like '%ARM%'
order by 1
Please let me know if you have any issues.
Another less scientific way to do this is to export the workflow(s) as XML and use a text editor to search through them for the stored procedure name.
If you have read access to the schema where the informatica repository resides, try this.
SELECT DISTINCT f.subj_name folder, e.mapping_name, object_type_name,
b.instance_name, a.attr_value
FROM opb_widget_attr a,
opb_widget_inst b,
opb_object_type c,
opb_attr d,
opb_mapping e,
opb_subject f
WHERE a.widget_id = b.widget_id
AND b.widget_type = c.object_type_id
AND ( object_type_name = 'Source Qualifier'
OR object_type_name LIKE '%Lookup%'
)
AND a.widget_id = b.widget_id
AND a.attr_id = d.attr_id
AND c.object_type_id = d.object_type_id
AND attr_name IN ('Sql Query')--, 'Lookup Sql Override')
AND b.mapping_id = e.mapping_id
AND e.subject_id = f.subj_id
AND a.attr_value is not null
--AND UPPER (a.attr_value) LIKE UPPER ('%currency%')
Yes. There is a small java based tool called Informatica Meta Query.
Using that tool, you can search for any information that is present in the Informatica meta data tables.
If you cannot find that tool, you can write queries directly in the Informatica Meta data tables to get the required information.
Adding few more lines to solution provided by Data Origin and Sandeep.
It is highly advised not to query repository tables directly. Rather, you can create synonyms or views and then query those objects to avoid any damage to rep tables.
In our dev/ prod environment application programmers are not granted any direct access to repo. tables.
As querying the Informatica database isn't the best idea, I would suggest you to export all the workflows in your folder into xml using Repository Manager. From Rep Mgr you can select all of them once and export them at once. Then write a java program to search the pattern from the xml's you have.
I have written a sample prog here, please modify it as per your requirement:
make a spec file with workflow names(specFileName).
main()
{
try {
File inFile = new File(specFileName);
BufferedReader reader = new BufferedReader(newFileReader(infile));
String tectToSearch = '<YourString>';
String currentLine;
while((currentLine = reader.readLine()) != null)
{
//trim newline when comparing with String
String trimmedLine = currentLine.trim();
if(currentline has the string pattern)
{
SOP(specFileName); //specfile name
}
}
reader.close();
}
catch(IOException ex)
{
System.out.println("Error reading to file '" + specFileName +"'");
}
}