I have been trying to import a sample nosql db to gcp datastore. when stored in gcs datastore is asking for data in specific extension i.e
.overall_export_metadata.
I don't believe there are any existing tools that can just import a CSV into datastore. You could write a Google Dataflow job to do this.
https://beam.apache.org/documentation/programming-guide/
https://cloud.google.com/dataflow/docs/quickstarts
It does look like they provide a template-based job that takes in a JSON file and writes it to datastore
https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#gcstexttodatastore
JSON format:
https://cloud.google.com/datastore/docs/reference/data/rest/v1/Entity
Related
I have build a Django WebApp. It has an sql database. I would like to analyze this data and share the analysis using online platform Jupyter notebook such as Kaggle.
I have already deployed to Google App Engine as an SQL instance, but I don't know how to view this SQL instance tables in Kaggle. There is an option to view BigQuery databases in Kaggle, but I don't know how to get the data from my SQL instance to BigQuery.
To be able to access the data with Kaddle you would need to import the data from the CloudSQL instance into BigQuery.
Currently there are some options for importing data into BigQuery, the best choice would depend on what type of analysis you want to do with it.
If you just want to import the data from the CloudSQL instance into BigQuery, the easiest way to do it would be to first export the data in CSV format and then import the CSV file into BigQuery.
In case you are working with a large database, you can also do it programmatically by using the Client Libraries.
I am trying to figure out how to import Google Analytics data into AWS Redshift. Until now I have been able to setup an export job so the data makes it to Google's BigQuery and then exporting the tables to Google's Cloud Storage.
BigQuery stores data in particular way, so when you export it to a file, it gives you a multilevel nested JSON structure. So, in order to import it to Redshift, I would have to "explode" that JSON into a table or CSV file.
I haven't been able to find a simple solution to do this.
Does anyone know how I can do this in an elegant and efficient way, instead of having to write a long function that will go through the whole JSON object?
Here's Google's documentation about how to export data https://cloud.google.com/bigquery/docs/exporting-data
You can try the following:
Export your BigQuery data as json into the S3 bucket
Create JSONPaths file according to specification
Include JSONPaths file in your COPY command to import into the Redshift
You may also try to export your BigQuery table as AVRO (one of the supported export file formas in BigQuery) instead of json. This link has an example of how to write the JSONPaths file for nested AVRO objects.
Can anyone let us know Using Dataflow in Google Cloud Platform, Can we load the CSV file from GCS to BigQuery with out any transformations
..Just simple load from GCS to BigQuery using Dataflow in Python. If yes, can you provide us....
Unfortunately, this is not possible to do without at least 1 transform. The bare minimum transform would be to convert a single row in the CSV (string) to a Python dictionary or a TableRow (from BigQuery API), which is required for writing to BigQuery via BigQuerySink.
Alternatively, you can simply use the bq command line tool to upload a CSV to BigQuery. This is much simpler and can be scheduled by any cron-like application. Note: This solution has different billing implications.
bq load --source_format=CSV <destination_table> <data_source_uri> [<table_schema>]
bq command line tool reference: https://cloud.google.com/bigquery/bq-command-line-tool
There's an excel file testFile.xlsx, it looks like as below:
ID ENTITY STATE
1 Montgomery County Muni Utility Dist No.39 TX
2 State of Washington WA
3 Waterloo CUSD 5 IL
4 Staunton CUSD 6 IL
5 Berea City SD OH
6 City of Coshocton OH
Now I want to import the data into the AWS GLUE database, a crawler in AWS GLUE has been created, there's nothing in the table in AWS GLUE database after running the crawler. I guess it should be the issue of classifier in AWS GLUE, but have no idea to create a proper classifier to successfully import data in the excel file to AWS GLUE database. Thanks for any answers or advice.
I'm afraid Glue Crawlers have no classifier for MS Excel files (.xlsx or .xls). Here you can find list of supported formats and built-in classifiers. Probably, it would be better to convert files to CSV or some other supported format before exporting to AWS Glue Catalog.
Glue crawlers doesn't support MS Excel files.
If you want to create a table for the excel file you have to convert it first from excel to csv/json/parquet and then run crawler on the newly created file.
You can convert it easily using pandas.
Create a normal python job and read the excel file.
import pandas as pd
df = pd.read_excel('yourFile.xlsx', 'SheetName', dtype=str, index_col=None)
df.to_csv('yourFile.csv', encoding='utf-8', index=False)
This will convert your file to csv then run crawler over this file and your table will be loaded.
Hope it helps.
When you say that "there's nothing in the table in AWS Glue database after running the crawler" are you saying that in the Glue UI, you are clicking on Databases, then the database name, then on "Tables in xxx", and nothing is showing up?
The second part of your question seems to indicate that you are looking for Glue to import the actual data rows of your file into the Glue database. Is that correct? The Glue database does not store data rows, just the schema information about the files. You will need to use a Glue ETL job, or Athena, or hive to actually move the data from the data file into something like mySQL.
You should write script (most likely python shell job in glue) to convert excel to csv and then run crawler over it.
I have several Django (python) based, back-end web applications that I would like to start piping data into Google Big Query in an automated fashion. The relational database on the backend is MySQL, these applications are not public facing and not in Google App Engine.
We already have Google Apps for Business along with a Google Big Data project set up. With that said I can manually dump tables into CSV and import into Big Query but is there some best practices on automating this kind of data delivery into Google? I've poured over the documentation and don't really see any definitive writing on this matter.
Any advice would be appreciated.
Thanks for reading
Recently WePay started a series of articles on how they use BigQuery to run their analytics. Their second article highlights how they use Apache AirFlow to move data from MySQL to BigQuery:
https://wecode.wepay.com/posts/airflow-wepay
As they mention "We have only a single config-driven ETL DAG file. It dynamically generates over 200 DAGs", and "The most important part is the select block. This defines which columns we pull from MySQL and load into BigQuery".
See the article for more details.
You can use Python robots, which run on Linux with crontab.
For loading into Google Cloud Platform BigQuery, I use pandas_gbq.to_gbq library:
Create your dataframe (df) according to this or this
In order to get the Token.jsonfile:
Create a Google Cloud Platform BigQuery service account.
Load the JSON file:
from google.oauth2 import service_account
import pandas as pd
import pandas_gbq
DIR = os.path.dirname(os.path.realpath(__file__))
TOKEN_AUTH = DIR + '/token.json'
CREDENTIALS = service_account.Credentials.from_service_account_file(TOKEN_AUTH)
#df is a pandas dataframe
pandas_gbq.to_gbq(df, '<dataset>.<table_name>', project_id='<project_id>',
if_exists=<replace or append> , credentials=CREDENTIALS)
Once you have created your tocken, install crontab on Linux and schedule your load-robot task:
Using crontab to execute script every minute and another every 24 hours
Finally, you can also use Apache Airflow (for advanced users with Docker skills)