we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.
Would it be wise to replace MR completely with Spark. Here are the areas where we still use MR and need your input to go ahead with Apache Spark option-
ETL : Data validation and transformation. Sqoop and custom MR programs using MR API.
Machine Learning : Mahout algorithms to arrive at recommendations, classification and clustering
NoSQL Integration : Interfacing with NoSQL Databases using MR API
Stream Processing : We are using Apache Storm for doing stream processing in batches.
Hive Query : We are already using Tez engine for speeding up Hive queries and see 10X performance improvement when compared with MR engine
ETL - Spark has much less boiler-plate code needed than MR. Plus you can code in Scala, Java and Python (not to mention R, but probably not for ETL). Scala especially, makes ETL easy to implement - there is less code to write.
Machine Learning - ML is one of the reasons Spark came about. With MapReduce, the HDFS interaction makes many ML programs very slow (unless you have some HDFS caching, but I don't know much about that). Spark can run in-memory so you can have programs build ML models with different parameters to run recursively against a dataset which is in-memory, so no file system interaction (except for the initial load).
NoSQL - There are many NoSQL datasources which can easily be plugged into Spark using SparkSQL. Just google which one you are interested in, it's probably very easy to connect.
Stream Processing - Spark Streaming works in micro-batches and one of the main selling points of Storm over Spark Streaming is that it is true streaming rather than micro batches. As you are already using batches Spark Streaming should be a good fit.
Hive Query - There is a Hive on Spark project which is going on. Check the status here. It will allow Hive to execute queries via your Spark Cluster and should be comparable to Hive on Tez.
If sas is installed on windows os then How to troubleshoot or catch its performance of sas DI jobs on unix? using any tools or commands or By using nmon?
Thank you...
Performance of each job in SAS Data Integration studio can be measured using a few techniques
ARM logging - Enabling/utilizing the Audit and Performance Measurement capabilities with SAS (commonly referred to as ARM logging).
You can add a parameter FULLSTIMER=I in the autoexec.sas for the session or for each session which will help in giving performance stats, there are many a code snippets or tools out there which can parse and give you fancy performance stats on the jobs.
I have been exploring a few more things now a days which is provided with SAS 9.4 called Environment Manager. This is a daemon which can be configured and give a lot of performance stats and other things. It is web based and a very handy tool for Admins.
Hope this information helps!
From what I understood, Hadoop is a distributed storage system thingy. However what I don't really get is, can we replace normal RDBMS(MySQL, Postgresql, Oracle) with Hadoop? Or is Hadoop is just another type of filesystem and we CAN run RDBMS on it?
Also, can Django integrated with Hadoop? Usually, how web frameworks (ASP.NET, PHP, Java(JSP,JSF, etc) ) integrate themselves with Hadoop?
I am a bit confused with the Hadoop vs RDBMS and I would appreciate any explanation.
(Sorry, I read the documentation many times, but maybe due to my lack of knowledge in English, I find the documentation is a bit confusing most of the time)
What is Hadoop?
Imagine the following challange: you have a lot of data, and with a lot I mean at least Terabytes. You want to transform this data or extract some informations and process it into a format which is indexed, compressed or "digested" in a way so you can work with it.
Hadoop is able to parallelize such a processing job and, here comes the best part, takes care of things like redundant storage of the files, distribution of the task over different machines on the cluster etc (Yes, you need a cluster, otherwise Hadoop is not able to compensate the performance loss of the framework).
If you take a first look at the Hadoop ecosystem you will find 3 big terms: HDFS(Hadoop Filesystem), Hadoop itself(with MapReduce) and HBase(the "database" sometimes column store, which does not fits exactly)
HDFS is the Filesystem used by both Hadoop and HBase. It is a extra layer on top of the regular filesystem on your hosts. HDFS slices the uploaded Files in chunks (usually 64MB) and keeps them available in the cluster and takes care of their replication.
When Hadoop gets a task to execute, it gets the path of the input files on the HDFS, the desired output path, a Mapper and a Reducer Class. The Mapper and Reducer is usually a Java class passed in a JAR file.(But with Hadoop Streaming you can use any comandline tool you want). The mapper is called to process every entry (usually by line, e.g.: "return 1 if the line contains a bad F* word") of the input files, the output gets passed to the reducer, which merges the single outputs into a desired other format (e.g: addition of numbers). This is a easy way to get a "bad word" counter.
The cool thing: the computation of the mapping is done on the node: you process the chunks linearly and you move just the semi-digested (usually smaller) data over the network to the reducers.
And if one of the nodes dies: there is another one with the same data.
HBase takes advantage of the distributed storage of the files and stores its tables, splitted up in chunks on the cluster. HBase gives, contrary to Hadoop, random access to the data.
As you see HBase and Hadoop are quite different to RDMBS. Also HBase is lacking of a lot of concepts of RDBMS. Modeling data with triggers, preparedstatements, foreign keys etc. is not the thing HBase was thought to do (I'm not 100% sure about this, so correct me ;-) )
Can Django integrated with Hadoop?
For Java it's easy: Hadoop is written in Java and all the API's are there, ready to use.
For Python/Django I don't know (yet), but I'm sure you can do something with Hadoop streaming/Jython as a last resort.
I've found the following: Hadoopy and Python in Mappers and Reducers.
Hue, The Web UI for Hadoop is based on Django!
Django can connect with most RDMS, so you can use it with a Hadoop based solution.
Keep in mind, Hadoop is many things, so specifically, you want something with low latency such as HBase, don't try to use it with Hive or Impala.
Python has a thrift based binding, happybase, that let you query Hbase.
Basic (!) example of Django integration with Hadoop
[REMOVED LINK]
I use Oozie REST api for job execution, and 'hadoop cat' for grabbing job results (due to HDFS' distributed nature). The better appoach is to use something like Hoop for getting HDFS data. Anyway, this is not a simple solution.
P.S. I've refactored this code and placed it into https://github.com/Obie-Wan/django_hadoop.
Now it's a separate django app.
How does this work exactly... if I have a data mining system built in php, how would it work differently on MapReduce than it would on a simple server? Is it the mere fact that there's more than 1 server doing the processing?
If your code is made to partition work between multiple processes already, then MapReduce only adds the ability to split work among additional servers.