How to export firestore data collection by collection field value - google-cloud-platform

I am trying to export google firestore data in cloud storage location using the below command.
gcloud firestore export gs://<bucketName>/<folderName> --collection-ids=<collectionName>
I would like to furthermore filter the collection by its field name value or date range, is it possible to do so? cloud function solution also welcome.
This is my js solution
console.log('data export bucket', bucket);
const databaseName = client.databasePath(
process.env.GCLOUD_PROJECT,
'project_name'
);
const response = await client
.exportDocuments({
name: databaseName,
outputUriPrefix: bucket,
// Leave collectionIds empty to export all collections
// or define a list of collection IDs:
// collectionIds: ['users', 'posts']
collectionIds: tables,
});
console.log('successfully exportToGCS', response);
}```

At the time of writing, there is no such possibility within the Firestore export feature. These options come to my mind:
1) Since exports and imports are charged at the same rate as read/write operations, I would suggest running a query in FS instead of using the export functionality. You can run such a query with regular Cloud Function as you are already proposing. Refer to this guide for getting multiple documents from a collection.
2) In case you really need the Firestore/Datastore export format, there's an option to copy all filtered data to a new collection and exporting from there, but you'll probably pay far more then you'd want to.
Depending on what you want do you want to do with the data next, you can create a Cloud Function that will listen for changes and copy your filtered data to a new collection, from which you can then export.

Related

GCP Dataflow UDF input

I'm trying to remove Datastore Bulk with Dataflow and to use JS UDF to filter entities regarding doc. But this code:
function func(inJson) {
var row = JSON.parse(inJson);
var currentDate = new Date();
var date = row.modifiedAt.split(' ')[0];
return some code
}
causes
TypeError: Cannot read property "split" from undefined
Input should be A JSON string of the entity and entities should have modifiedAt property.
What exactly passes Dataflow to UDF and how could I log it in Dataflow console?
Assuming that modifiedAt is a property you added, I would expect the JSON in Dataflow to match the Datastore rest api (https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity). Which would mean you probably want row.properties.modifiedAt . You also probably want to pull out the timestampValue of the property (https://cloud.google.com/datastore/docs/reference/data/rest/v1/projects/runQuery#Value).
Take a look at this newly published Google Cloud blog on how to get started with Dataflow UDFs.
With the Datastore Bulk Delete Dataflow template, the UDF input is the JSON string of the Datastore Entity. As noted by Jim, the properties are nested under properties field per this schema.
With respect to your other question about logging, you can add in your UDF the following print statement whose output will show up in Dataflow console under Worker Logs:
print(JSON.stringify(row, null, 2))
Alternatively, you can view the logs in Cloud Logging using the following log query:
log_id("dataflow.googleapis.com/worker")
resource.type="dataflow_step"

How to schedule an export from a BigQuery table to Cloud Storage?

I have successfully scheduled my query in BigQuery, and the result is saved as a table in my dataset. I see a lot of information about scheduling data transfer in to BigQuery or Cloud Storage, but I haven't found anything regarding scheduling an export from a BigQuery table to Cloud Storage yet.
Is it possible to schedule an export of a BigQuery table to Cloud Storage so that I can further schedule having it SFTP-ed to me via Google BigQuery Data Transfer Services?
There isn't a managed service for scheduling BigQuery table exports, but one viable approach is to use Cloud Functions in conjunction with Cloud Scheduler.
The Cloud Function would contain the necessary code to export to Cloud Storage from the BigQuery table. There are multiple programming languages to choose from for that, such as Python, Node.JS, and Go.
Cloud Scheduler would send an HTTP call periodically in a cron format to the Cloud Function which would in turn, get triggered and run the export programmatically.
As an example and more specifically, you can follow these steps:
Create a Cloud Function using Python with an HTTP trigger. To interact with BigQuery from within the code you need to use the BigQuery client library. Import it with from google.cloud import bigquery. Then, you can use the following code in main.py to create an export job from BigQuery to Cloud Storage:
# Imports the BigQuery client library
from google.cloud import bigquery
def hello_world(request):
# Replace these values according to your project
project_name = "YOUR_PROJECT_ID"
bucket_name = "YOUR_BUCKET"
dataset_name = "YOUR_DATASET"
table_name = "YOUR_TABLE"
destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")
bq_client = bigquery.Client(project=project_name)
dataset = bq_client.dataset(dataset_name, project=project_name)
table_to_export = dataset.table(table_name)
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = bq_client.extract_table(
table_to_export,
destination_uri,
# Location must match that of the source table.
location="US",
job_config=job_config,
)
return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)
Specify the client library dependency in the requirements.txt file
by adding this line:
google-cloud-bigquery
Create a Cloud Scheduler job. Set the Frequency you wish for
the job to be executed with. For instance, setting it to 0 1 * * 0
would run the job once a week at 1 AM every Sunday morning. The
crontab tool is pretty useful when it comes to experimenting
with cron scheduling.
Choose HTTP as the Target, set the URL as the Cloud
Function's URL (it can be found by selecting the Cloud Function and
navigating to the Trigger tab), and as HTTP method choose GET.
Once created, and by pressing the RUN NOW button, you can test how the export
behaves. However, before doing so, make sure the default App Engine service account has at least the Cloud IAM roles/storage.objectCreator role, or otherwise the operation might fail with a permission error. The default App Engine service account has a form of YOUR_PROJECT_ID#appspot.gserviceaccount.com.
If you wish to execute exports on different tables,
datasets and buckets for each execution, but essentially employing the same Cloud Function, you can use the HTTP POST method
instead, and configure a Body containing said parameters as data, which
would be passed on to the Cloud Function - although, that would imply doing
some small changes in its code.
Lastly, when the job is created, you can use the Cloud Function's returned job ID and the bq CLI to view the status of the export job with bq show -j <job_id>.
Not sure if this was in GA when this question was asked, but at least now there is an option to run an export to Cloud Storage via a regular SQL query. See the SQL tab in Exporting table data.
Example:
EXPORT DATA
OPTIONS (
uri = 'gs://bucket/folder/*.csv',
format = 'CSV',
overwrite = true,
header = true,
field_delimiter = ';')
AS (
SELECT field1, field2
FROM mydataset.table1
ORDER BY field1
);
This could as well be trivially setup via a Scheduled Query if you need a periodic export. And, of course, you need to make sure the user or service account running this has permissions to read the source datasets and tables and to write to the destination bucket.
Hopefully this is useful for other peeps visiting this question if not for OP :)
You have an alternative to the second part of the Maxim answer. The code for extracting the table and store it into Cloud Storage should work.
But, when you schedule a query, you can also define a PubSub topic where the BigQuery scheduler will post a message when the job is over. Thereby, the scheduler set up, as described by Maxim is optional and you can simply plug the function to the PubSub notification.
Before performing the extraction, don't forget to check the error status of the pubsub notification. You have also a lot of information about the scheduled query; useful is you want to perform more checks or if you want to generalize the function.
So, another point about the SFTP transfert. I open sourced a projet for querying BigQuery, build a CSV file and transfert this file to FTP server (sFTP and FTPs aren't supported, because my previous company only used FTP protocol!). If your file is smaller than 1.5Gb, I can update my project for adding the SFTP support is you want to use this. Let me know

Is GCP Firestore Native Mode export to BQ import supported?

I was exploring option to load Firestore Native Mode data (collection and documents) into BQ. But its not working out for me.
Question: Does Big Query support import of extract from Firestore Native export?
Setup: 1 collection with multiple documents (no sub collections).
Steps:
- Export to Cloud Bucket: https://firebase.google.com/docs/firestore/manage-data/export-import
- Import in BQ: https://cloud.google.com/bigquery/docs/loading-data-cloud-firestore
Error While loading in BQ: 'Does not contain valid backup metadata'
Analysis: Its mentioned in the link that URI should have KIND_COLLECTION_ID and that file should end with [KIND_COLLECTION_ID].export_metadata. But none of these are true for Firestore Native mode export file. Its applicable for Firestore Datastore mode export.
Verify [KIND_COLLECTION_ID] is specified in your Cloud Storage URI. If you specify the URI without
[KIND_COLLECTION_ID], you receive the following error: does not contain valid backup metadata. (error
code: invalid)
The URI for your Cloud Firestore export file should end with [KIND_COLLECTION_ID].export_metadata.
For example: default_namespace_kind_Book.export_metadata. In this example, Book is the collection ID,
and default_namespace_kind_Book is the file name generated by Cloud Firestore
When one creates an export of firestore collections to GCS, a directory structure is created that looks like:
[Bucket]
- [Date/Time]
- [Date/Time].overall_export_metadata
- all_namespaces
- kind_[collection]
- all_namespaces_kind_[collection].export_metadata
When one imports an export into BigQuery, use the file:
[Bucket]/[Date/Time]/all_namespaces/kind_[collection]/all_namespaces_kind_[collection].export_metadata
Specifically, if one uses [Bucket]/[Date/Time]/[Date/Time].overall_export_metadata you will get the error you described. See also the note here under Console > Bullet 3 which reads:
Note: Do not use the file ending in overall_export_metadata. This file
is not usable by BigQuery.
If you want to create a pipeline from Firestore to Bigquery you should manualy format the Firestore collection to a Bigquery Table. I have used gcloud scheduler, cloud functions and firestore batched operations to migrate the data from Firestore to Bigquery. I created an example code here

AWS Amplify filter for #searchable annotation

Currently I am using a DynamoDB instance for my social media application. While designing the schema I sticked to the "one table" rule. So I am putting every data in the same table like posts, users, comments etc. Now I want to make flexible queries for my data. Here I found out that I could use the #searchable annotation to create an Elastic Search instance for a table which is annotated with #model
In my GraphQL schema I only have one #model, since I only have one table. My problem now is that I don't want to make everything in the table searchable, since that would be most likely very expensive. There are some data which don't have to be added to the Elastic Search instance (For example comment related data). How could I handle it ? Do I really have to split my schema down into multiple tables to be able to manage the #searchable annotation ? Couldn't I decide If the row should be stored to the Elastic Search with help of the Partitionkey / Primarykey, acting like a filter ?
The current implementation of the amplify-cli uses a predefined python Lambda that are added once we add the #searchable directive to one of our models.
The Lambda code can not be edited and currently, there is no option to define a custom Lambda, you read about it
https://github.com/aws-amplify/amplify-cli/issues/1113
https://github.com/aws-amplify/amplify-cli/issues/1022
If you want a custom Lambda where you can filter what goes to the Elasticsearch Instance, you can follow the steps described here https://github.com/aws-amplify/amplify-cli/issues/1113#issuecomment-476193632
The closest you can get is by creating a template in amplify\backend\api\myapiname\stacks\ where you can manage all the resources related to Elasticsearch. A good start point is to
Add #searchable to one of your model in the schema.grapql
Run amplify api gql-compile
Copy the generated template in the build folder, \amplify\backend\api\myapiname\build\stacks\SearchableStack.json to amplify\backend\api\myapiname\stacks\
Remove the #searchable directive from the model added in step 1
Start editing your new template copied in step 3
Add a Lambda and use it in the template as the resolver for the DynamoDB Stream
Using this approach will give you total control of the resources related to the Elasticsearch service, but, will also require to do it all by your own.
Or, just go by creating a table for each model.
Hope it helps
It is now possible to override the generated streaming function code as well.
thanks to the AWS Support for the information provided
leaved a message on the related github issue as well https://github.com/aws-amplify/amplify-category-api/issues/437#issuecomment-1351556948
All you need is to run
amplify override api
edit the corresponding overrode.ts
change the code with the resources.opensearch.OpenSearchStreamingLambdaFunction.code
resources.opensearch.OpenSearchStreamingLambdaFunction.functionName = 'python_streaming_function';
resources.opensearch.OpenSearchStreamingLambdaFunction.handler = 'index.lambda_handler';
resources.opensearch.OpenSearchStreamingLambdaFunction.code = {
zipFile: `
# python streaming function customized code goes here
`
}
Resources:
[1] https://docs.amplify.aws/cli/graphql/override/#customize-amplify-generated-resources-for-searchable-opensearch-directive
[2]AWS::Lambda::Function Code - Properties - https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-lambda-function-code.html#aws-properties-lambda-function-code-properties

Serialize Json / Web Services to Observable Collection Model

I want to ask, so i've consume a web service api and than serialize it into a observable collection of Model.
My question is how i can use this observable collection everywhere, so i don't have to call/get/consume from web services everytime?
So just call the api one time and then can use the data everytime without callng API again?
Thanks
As #thang mentioned above there are many ways to store the data in the app to eliminate calling web service each time.
I will suggest you the way I am doing it:
1.When I retrieve the JSON data from the Web Api I am parsing it to Observable Collection:
ObservableCollection<User> usersList = JsonConvert.DeserializeObject<ObservableCollection<User>>(responseJson).Users;
2.Once I have my list I can also save serialized objects (in JSON format) to the text file (remember that JSON is nothing else like string):
private async void saveUsersToFile(string serializedUsersListAsJson)
{
StorageFolder storageFolder = ApplicationData.Current.LocalFolder;
StorageFile usersFile = await storageFolder.CreateFileAsync("users.txt", CreationCollisionOption.OpenIfExists);
await FileIO.WriteTextAsync(usersFile, serializedUsersListAsJson);
}
This step allows you to store the data even if the app is closed and relaunched.
3.When you launch the app you can invoke below method to read data from the file:
private async void retrieveNotes()
{
StorageFolder storageFolder = ApplicationData.Current.LocalFolder;
StorageFile usersFile = await storageFolder.CreateFileAsync("users.txt", CreationCollisionOption.OpenIfExists)
string serializedUsersList = await FileIO.ReadTextAsync(usersFile );
// Deserialize JSON list to the ObservableCollection:
if(serializedUsersList!=null)
{
var usersList= JsonConvert.DeserializeObject<ObservableCollection<User>>(serializedUsersList);
}
4.Last step is to declare Observable Collection field in Pages where you need to use it. For instance if you need to pass this list between Pages you can just use:
Frame.Navigate(typeof(MainPage), usersList);
Remember to read the data from the file once the app is launched. After that you can just use it while app is running. My suggestion is to cache data each time you connect Web Api to retrieve new data.
Hope this will help. If you want to read more about data storage please read below post on my blog:
https://mobileprogrammerblog.wordpress.com/2016/05/23/universal-windows-10-apps-data-storage/
To save the data for next time user open the app
Store the data in local Sqlite database, or serialized the collection to a local file to use later.
To use the data in same section
Store the data in a common object, and retrieve it every time you need to initialize the ViewModel
To use the data across Win 10 device
Store the file / database in OneDrive and sync when needed
If the data size is small and you dont have critical need to have 100% sync data, store it inside roaming folder