I am not sure if this is the correct forum, but any help on this would be really appreciated.
Can someone please share some links/references which could help me in analyzing the feasibility of database migration from IBM Netezza to Amazon Redshift?
Kamlesh,
There are a lot of similarities between both technologies: IBM Pure Data/Netezza and AWS Redshift.
Some developers who worked on the first version of Netezza also worked on the first version of ParAccel DB. AWS Redshift utilizes the same core engine as ParAccel DB. ParAccel has been sold and the product has been re-branded as Actian Matrix. Still, the core engine is the same.
Both databases are MPP implementations, with a shared nothing architecture. Both share a PostgreSQL "heritage". AWS Redshift truly is a "columnar" database, while Netezza is not.
There are a few differences in SQL Syntax and also some differences in functionality. There are several features/capabilities that AWS Redshift does not yet have. Some of the most "noteworthy" differences is the fact that Redshift does not support Stored Procs, User Defined Functions or Sequences.
Amazon AWS lists the differences between AWS Redshift and PostgreSQL in this document. While this is not a comparison between Netezza and Redshift it will give you a good idea of "what to expect" in terms of differences since both Netezza and Redshift were both originally based on postgreSQL.
Related
Recently I cleared the google cloud PCA exam but want to clarify one question which I have doubt.
" You are tasked with building online analytical processing (OLAP) marketing analytics and reporting tools. This requires a relational database that can operate on hundreds of terabytes of data. What is the Google-recommended tool for such applications?"
What is the answer? Is it Bigquery or cloud spanner? as there are 2 parts in question. If we consider it for OLAP then it is Bigquery and for 2nd part for RDBMS it should be Cloud Spanner.
Appreciate it if I can have some clarification.
Thanks
For Online Analytical Processing (OLAP) databases, consider using BigQuery.
When performing OLAP operations on normalized tables, multiple tables have to be JOINed to perform the required aggregations. JOINs are possible with BigQuery and sometimes recommended on small tables.
You can check this documentation for further information.
BigQuery for OLAP and Google Cloud Spanner for OLTP.
Please check this other page for more information about it.
I agree that the question is confusing.
But according to the official documentation :
Other storage and database options
If you need interactive querying in an online analytical processing
(OLAP) system, consider BigQuery.
However BigQuery is not considered relational database.
The big query does not provide you relationship between tables but you can join them freely.
If your performance falls cluster then partition on the joining fields.
Is it possible to create relationships between tables?
Some more literature if some want to go into the details.
By using MapReduce, enterprises can cost-effectively apply parallel
data processing on their Big Data in a highly scalable manner, without
bearing the burden of designing a large distributed computing cluster
from scratch or purchasing expensive high-end relational database
solutions or appliances.
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
Hence Bigquery
I am novice in GCP stack so I am so confused about amount GCP technologies for storing data:
https://cloud.google.com/products/storage
Although google cloud spanner is not mentioned in the article above I know that it is exist and iti is used for data storage: https://cloud.google.com/spanner
From my current view I don't see any significant difference between cloud sql(with postgres under the hood) and cloud spanner. I found that it has a bit different syntax but it doesn't answer when I should prefer this techology to spring cloud sql.
Could you please explain it ?
P.S.
I consider spring cloud sql as a traditional database with automatic replication and horizontal scalability managed by google.
There is not a big difference between them in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale
Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability and better SLOs.
On the other hand, Spanner is also much more expensive than Cloud SQL.
If you just want to store some data of your customer in a cheap way but still don't want to face server configuration Cloud SQL is the right choice.
If you are planning to create a big product or if you want to be ready for a huge increase in users for your application (viral games/applications) Spanner is the right product.
You can find detailed information about Cloud Spanner in this official paper
The main difference between Cloud Spanner and Cloud SQL is the horizontal scalability + global availability of data over 10TB.
Spanner isn’t for generic SQL needs, Spanner is best used for massive-scale opportunities. 1000s of writes per second, globally. 10,000s - 100,000s of reads per second, globally.
Above volume is extremely difficult to achieve with NORMAL SQL / MySQL without doing complex sharding of the database. Spanner deals with all this AND allows ACID updates (which is basically impossible with sharded databases). They accomplish this with super-accurate clocks to manage conflicts.
In short, Spanner is not for CRM databases, it is more for supermassive global data within an organisation. And since Spanner is a bit expensive (compared to cloud SQL), the project should be large enough to justify the additional cost of Spanner.
You can also follow this discussion on Reddit (a good one!): https://www.reddit.com/r/googlecloud/comments/93bxf6/cloud_spanner_vs_cloud_sql/e3cof2r/
Previous answers are correct, the main advantages of Spanner are scalability and availability. While you can scale with Cloud SQL, there is an upper bound to write throughput unless you shard -- which, depending on your use case, can be a major challenge. Dealing with sharded SQL was the big problem that Spanner solved within Google.
I would add to the previous answers that Cloud SQL provides managed instances of MySQL or PostgreSQL or SQL Server, with the corresponding support for SQL. If you're migrating from a MySQL database in a different location, not having to change your queries can be a huge plus.
Spanner has its own SQL dialect, although recently support for a subset of the PostgreSQL dialect was added.
I am just curious to know when you run a query in Redshift spectrum what's happening under the hood?
Is it running a spark job? or map-reduce job or presto or something totally different?
I'm not positive but I believe Spectrum was based off Apache Impala. At this point I'd imagine the code is hardly recognizable as Impala though -- Spectrum has been under active development for quite a while now.
I'm guessing Impala based off Spectrum's capabilities/behavior as relates to timestamps in Parquet files.
I have choice of using Hbase or cassandra. I will be writing map-reduce tasks to process data.
So which will be better choice Hbase or cassandra. And which will be better to use with hive and pig?
I have used both. I am not sure what #Tariq means by modified without cluster restart as I don't restart the cluster when I modify cassandra schemas. I have not used Pig and Hive but from what I understand those just sit on map/reduce and I have used the map/reduce cassandra adapter which works great. We also know people who have used PlayOrm with map/reduce a bit as well and PlayOrm as of yet does not have the hbase provider written. They have cassandra and mongodb right now so you can write your one client and it works on either database. Of course for specific features of each nosql store, you can get the driver and talk directly to the nosql store instead of going through playOrm but many features are very similar between nosql stores.
I would suggest HBase as it has got native MR support and runs on top of your existing Hadoop cluster seamlessly. Also, simpler schema that can be modified without a cluster restart is a big plus. It also provides easy integration with Pig and Hive as well.
I am trying to understand how Django and Appengine work together?
First, question: Is this a good team?
Experience, what is possible and what not, would be great.
I also read some modules like auth, admin wont work.
But the article is rather old, so maybe there is an update.
And in that tutorial one has to import bforms.
What is that?
Django Module? Appengine? Python? Bigtable?
How is Bigtable different from regular SQL, MySQL?
Thanks
Regular SQL and MySQL are designed for one computer only and fail in cloud computing where you need 1,000 computers for one database. Thus the next generation databases, like bigtable, were created to distribute the data over many database servers. They are called NoSQL databases for "Not Only SQL." See http://nosql-database.org/ for a list of NoSQL databases. The google app engine apparently allows you to use the bigtable structure so you data is distributed over a dozen database servers in the cloud. So does Amazon's simple db.