Important concepts of BigQuery DWH - google-cloud-platform

I wanted to know like which concepts/topics I need to learn in order to work for a BigQuery DWH project? Along with Big Query, what other programming languages I need to get acquainted or expertise with(like python)? I am currently working as data enginner with ssis, informatica, power bi skills with strong sql. Please give your valuable suggestions.
Thanks,
Ven.

BigQuery has an SQL interface so if you don't already know SQL, learn it.
See the query reference.
Also, you can interact with BigQuery using Bash, with the bq CLI provided as a Google Cloud component in the gcloud CLI, or with Python, Go, Java, node.js... (choose your favorite).
Actually if you are not planning a long term project, or become an expert of BigQuery, the more complex concepts are not needed. In case you want to know more about it I link a pretty interesting blog.
To sum up:
Learn SQL
Take into account that BigQuery is optimized for reading and performing analysis, it is not a common database (do not exceed with writes)
Most common languages has a bigquery client, so you don't need to learn any new language.

Related

Connecting data from Big Query to Cloud Fucntion to perfrom NLP

I wish to perform sentimental analysis using Google Natural Language API.
I found a documentation that perform sentiment analysis directly on a file located in Cloud Storage, https://cloud.google.com/natural-language/docs/analyzing-sentiment#language-sentiment-string-python.
However, my data that i am working on is instead located in Big Query. I am wondering how do I call the data directly from Big Query table to do the Sentimental Analysis?
An example of the Big Query Table schema:
I wish to do NLP on the tweet columns of the table.
I tried to search for documentation on it but seems to not find anything.
I would appreciate any help or references. Thank You.
You can take a look at BigQuery Remote Functions which provide a direct integration with Cloud Functions and Cloud Run. The columns returned from BigQuery SQL can be passed to the Remote Functions and a custom code can be executed as per the requirements. Please do note that Remote Functions are still in preview and might not be suitable for production systems.
This should be fairly straightforward to do with Dataflow - you could write a pipeline that reads from BigQuery followed by a DoFn that uses Google's NLP Libraries, and then writes the results to BigQuery.
Some wrappers are already provided for you in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/ml/gcp/naturallanguageml.py

How are you coping up with Bigquery especially when you came from traditional RDMS background like Oracle/Mysql?

I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.

How to import a SQL Dump File from Google Cloud Storage into Cloud SQL as a Daily Job?

I'm trying to import a SQL dump file from Google Cloud Storage into Cloud SQL (Postgres database) as a daily job.
I saw on Google Documentation for the CloudAPI that there was a way to programmatically import a SQL dump file (URL: https://cloud.google.com/sql/docs/postgres/admin-api/v1beta4/instances/import#examples), but quite honestly, I'm a bit lost here. I haven't programmed using APIs before, and I think this is a major factor here.
In the documentation, I see that there's an area for a HTTP POST request, as well as code, but I'm not sure where this would go. Ideally, I'd like to use other Cloud products to make this daily job happen. Any help would be much appreciated.
(Side note:
I was looking into creating a cron job in Compute Engine for this, but I'm worried about ease of maintenance, especially since I have other jobs I want to build that are dependent on this one.
I'd read that Dataflow could help with this, but I haven't seen anything (tutorials) that suggests it can yet. I'm also fairly new to Dataflow, so that could be a factor as well. )
I would suggest using google-cloud-composer which is essentially airflow for this. There are a lot of Operators to move files between various locations. You can find more information here
I must warn though, that it is still in Beta and unlike google's expected beta this one is rather flaky (at least in my experience)

looking for a hosted back-end business data storage for analytics

i want a simple hosted data store for licensed for business applications. i want the following features:
REST-like access for CRUD operations (primarily adding records)
private and authenticated
makes for easy integration with a front end charting client like Google Visualization Apis
easy to use and set up
what about:
* Google Fusion Tables
* Google Cloud Services
* Google BigQuery
* Google Cloud SQL
or other non-google products. but i am imagining a cleaner integration between Google Charts and one of their backend data services.
Pros, Cons, Advice?
First, since this is Stack Overflow, I won't attempt to provide a judgement about how about "easy to use and setup" - that can be done by you reading the documentation for each product.
That being said, overall, the "right" answer really depends on what you are trying to do, and how much data you have. It also depends on what type of application you are building (this is Stack Overflow, so I am assuming you are a developer).
Relational Databases (like Google Cloud SQL) are great for maintaining transactional consistency but once your data grows massive it becomes difficult, expensive, or impossible to run analysis queries in a reasonable timeframe.
Google BigQuery is an analysis tool that allows developers to ask questions about really really big datasets using an SQL like language. It is 100% cloud based and is accessed via RESTful API - but it only allows for appending data, not changing individual records.

ad hoc query tool patterns

I'm looking for common patterns of implementing ad-hoc querying capabilites graphically. I've looked at SQL query builders in Access and TOAD, but I'm interested if anyone is aware of products that have build such a tool against a domain specific data warehouse (e.g. clinical databases).
Thanks,
Beyond Tableau (mentioned by Arthur), I would suggest either Qlikview or Spotfire, both of which allow for ad-hoc graphical querying in in-memory databases. These applications are much more powerful than something like Crystal or Jasper reports.
I have no specific answer but there are reporting tools that help you do stuff that I think you might be interested in.
One pay one that I tried out myself and liked quiet a bit was Tableau It is pay software and the server can be expensive, but I liked the desktop app. You will have to know enough database to figure out how to draw data out of it. Once that is done though you can reuse it. Once you draw out the dataset though you can 'play' with it graphically.
You can get into more complicated Reporting tools like Crystal Reports, Jasper Report and I think IBM has something that deals with 'Cubes' or whatever. You can look up all that by looking into Business Intelligence software. (I hate that name)
The problem with having domain specific stuff is that databases can be different. And even if you use a common vendor tool then the the query tool would have to be built specific to the db.
So maybe this doesn't answer your direct question but hope it is a little helpful.