I intend to setup spark cluster on EC2. How much resources spark master instance actually needs? Since master is not involved in processing any of the tasks can it be the smallest EC2 instance?
This obviously depends on what kinds of jobs you're planning to run, how big is the cluster etc, so in that sense the advice to simply try different configurations is good. However, in my purely personal experience the driver instance should be at least at the level of the slave instances. This is mainly due to two reasons.
First of all, there are times when you need the result of the job in a single place. Maybe you just don't want to spend time combining files, maybe you need the results in some specific order which would be hard to achieve in a distributed way etc. but this means the driver should be able to hold all the data (as rdd.collect gathers the results to the driver instance).
Second of all, many of the shuffle-based operations seem to require a lot of memory from the driver. I'm not exactly sure about the details of why this happens (if anyone knows, please do share) but I can't count the number of times I've seen reduceyKey causing an out of memory error from the driver.
Edit: I have assumed you were using Spark's spark-ec2 script, which I believe does install the NameNode in the master instance. If the NameNode is not installed at the master intance, however, my answer has no validity as correctly pointed by #DemetriKots in the comments.
Although the master instance is not involved in data processing, it plays a major role during the management of the workload and resource allocation, e.g (all info is taken from the sources):
NameNode
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
This (look for Hardware Recommendations for Hadoop on the left index) Hortonworks document specifies some recommendations for the master instance in a Hadoop cluster. While it might not be adequate for the slave instances (due to Spark's memory usage), I would say it can be useful in the case of the master instance in a Spark cluster.
Related
Recently in a system design interview I was asked a question where cities were divided into zones and data of around 100 zones was available. An api took the zoneid as input and returned all the restaurants for that zone in response. The response time for the api was 50ms so the zone data was kept in memory to avoid delays.
If the zone data is approximately 25GB, then if the service is scaled to say 5 instances, it would need 125GB ram.
Now the requirement is to run 5 instances but use only 25 GB ram with the data split between instances.
I believe to achieve this we would need a second application which would act as a config manager to manage which instance holds which zone data. The instances can get which zones to track on startup from the config manager service. But the thing I am not able to figure out is how we redirect the request for a zone to the correct instance which holds its data especially if we use kubernetes. Also if the instance holding partial data restarts then how do we track which zone data it was holding
Splitting dataset over several nodes: sounds like sharding.
In-memory: the interviewer might be asking about redis or something similar.
Maybe this: https://redis.io/topics/partitioning#different-implementations-of-partitioning
Redis cluster might fit -- keep in mind that when the docs mention "client-side partitioning": the client is some redis client library, loaded by your backends, responding to HTTP client/end-user requests
Answering your comment: then, I'm not sure what they were looking for.
Comparing Java hashmaps to a redis cluster isn't completely fair, considering one is bound to your JVM, while the other is actually distributed / sharded, implying at least inter-process communications and most likely network/non-local queries.
Then again, if the question is to scale an ever-growing JVM: at some point, we need to address the elephant in the room: how do you guarantee data consistency, proper replication/sharding, what do you do when a member goes down, ...?
Distributed hashmap, using Hazelcast, may be more relevant. Some (hazelcast) would make the argument it is safer under heavy write load. Others that migrating from Hazelcast to Redis helped them improve service reliability. I don't have enough background in Java myself, I wouldn't know.
As a general rule: when asked about Java, you could argue that speed and reliability very much rely on your developers understanding of what they're doing. Which, in Java, implies a large margin of error. While we could suppose: if they're asking such questions, they probably have some good devs on their payroll.
Whereas distributed databases (in-memory, on disk, SQL or noSQL), ... is quite a complicated topic, that you would need to master (on top of java), to get it right.
The broad approach they're describing was described by Adya in 2019 as a LInK store. Linked In-memory Key-value stores allow for application objects supporting rich operations to be sharded across a cluster of instances.
I would tend to approach this by implementing a stateful application using Akka (disclaimer: I am at this writing employed by Lightbend, which employs the majority of the developers of Akka and offers support and consulting services to clients using Akka; as my SO history indicates, I would have the same approach even multiple years before I was employed by Lightbend) along these lines.
Akka Cluster to allow a set of JVMs running an application to form a cluster in a peer-to-peer manner and manage/track changes in the membership (including detecting instances which have crashed or are isolated by a network partition)
Akka Cluster Sharding to allow stateful objects keyed by ID to be distributed approximately evenly across a cluster and rebalanced in response to membership changes
These stateful objects are implemented as actors: they can update their state in response to messages and (since they process messages one at a time) without needing elaborate synchronization.
Cluster sharding implies that the actor responsible for an ID might exist on different instances, so that implies some persistence of the state of the zone outside of the cluster. For simplicity*, when an actor responsible for a given zone starts, it initializes itself from datastore (could be S3, could be Dynamo or Cassandra or whatever): after this its state is in memory so reads can be served directly from the actor's state instead of going to an underlying datastore.
By directing all writes through cluster sharding, the in-memory representation is, by definition, kept in sync with the writes. To some extent, we can say that the application is the cache: the backing datastore only exists to allow the cache to survive operational issues (and because it's only in response to issues of that sort that the datastore needs to be read, we can optimize the data store for writes vs. reads).
Cluster sharding relies on a conflict-free replicated data type (CRDT) to broadcast changes in the shard allocation to the nodes of the cluster. This allows, for instance, any instance to handle an HTTP request for any shard: it simply forwards a representation of the important parts of the request as a message to the shard which will distribute it to the correct actor.
From Kubernetes' perspective, the instances are stateless: no StatefulSet or similar is needed. The pods can query the Kubernetes API to find the other pods and attempt to join the cluster.
*: I have a fairly strong prior that event sourcing would be a better persistence approach, but I'll set that aside for now.
My use case is as follow:
We have about 500 servers running in an autoscaling EC2 cluster that need to access the same configuration data (layed out in a key/value fashion) several million times per second.
The configuration data isn't very large (1 or 2 GBs) and doesn't change much (a few dozen updates/deletes/inserts per minute during peak time).
Latency is critical for us, so the data needs to be replicated and kept in memory on every single instance running our application.
Eventual consistency is fine. However we need to make sure that every update will be propagated at some point. (knowing that the servers can be shutdown at any time)
The update propagation across the servers should be reliable and easy to setup (we can't have static IPs for our servers, or we don't wanna go the route of "faking" multicast on AWS etc...)
Here are the solutions we've explored in the past:
Using regular java maps and use our custom built system to propagate updates across the cluster. (obviously, it doesn't scale that well)
Using EhCache and its replication feature. But setting it up on EC2 is very painful and somehow unreliable.
Here are the solutions we're thinking of trying out:
Apache Ignite (https://ignite.apache.org/) with a REPLICATED strategy.
Hazelcast's Replicated Map feature. (http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#replicated-map)
Apache Geode on every application node. (http://geode.apache.org/)
I would like to know if each of those solutions would work for our use case. And eventually, what issues I'm likely to face with each of them.
Here is what I found so far:
Hazelcast's Replicated Map is somehow recent and still a bit unreliable (async updates can be lost in case of scaling down)
It seems like Geode became "stable" fairly recently (even though it's supposedly in development since the early 2000s)
Ignite looks like it could be a good fit, but I'm not too sure how their S3 discovery based system will work out if we keep adding / removing node regularly.
Thanks!
Geode should work for your use case. You should be able to use a Geode Replicated region on each node. You can choose to do synchronous OR asynchronous replication. In case of failures, the replicated region gets an initial copy of the data from an existing member in the system, while making sure that no in-flight operations are lost.
In terms of configuration, you will have to start a couple/few member discovery processes (Geode locators) and point each member to these locators. (We recommend that you start one locator/AZ and use 3 AZs to protect against network partitioning).
Geode/GemFire has been stable for a while; powering low latency high scalability requirements for reservation systems at Indian and Chinese railways among other users for a very long time.
Disclosure: I am a committer on Geode.
Ignite provides native AWS integration for discovery over S3 storage: https://apacheignite-mix.readme.io/docs/amazon-aws. It solves main issue - you don't need to change configuration when instances are restarted. In a nutshell, any nodes that successfully joins topology writes its coordinates to a bucket (and removes them when fails or leaves). When you start a new node, it reads this bucket and connects to one the listed addresses.
Hazelcast's Replicated Map will not work for your use-case. Note that it is a map that is replicated across all it's nodes not on the client nodes/servers. Also, as you said, it is not fully reliable yet.
Here is the Hazelcast solution:
Create a Hazelcast cluster with a set of nodes depending upon the size of data.
Create a Distributed map(IMap) and tweak the count & eviction configurations based on size/number of key/value pairs. The data gets partitioned across all the nodes.
Setup Backup count based on how critical the data is and how much time it takes to pull the data from the actual source(DB/Files). Distributed maps have 1 backup by default.
In the client side, setup a NearCache and attach it to the Distributed map. This NearCache will hold the Key/Value pair in the local/client side itself. So the get operations would end up in milliseconds.
Things to consider with NearCache solution:
The first get operation would be slower as it has to go through network to get the data from cluster.
Cache invalidation is not fully reliable as there will be a delay in synchronization with the cluster and may end reading stale data. Again, this is same case across all the cache solutions.
It is client's responsibility to setup timeout and invalidation of Nearcache entries. So that the future pulls would get fresh data from cluster. This depends on how often the data gets refreshed or value is replaced for a key.
Friends, I came to know that in hadoop2 when we configure high availability there is no need to configure a secondary-name-node/checkpoint-node/backup-node. With a new kind of mechanism the availability is given by edits shared among the active and standby namenodes.
My question is, secondary-name-node functionality is to merge the edits file with fsimage file periodically, thus gives 2 benefits in hadoop1 world 1) limits the size of edits file and 2) reduces the time of restart by keeping the fsimage nearly up to date.
Therefore, if High Availability is enabled and if secondary-name-node is not required. Then who will do the stiching of edits with fsimage? or is that step not required now due to some architectural/process changes.
Help me to understand it.
There are two modes of deploying HDFS HA (N.B. this is the current 2.7.1 state, if you land on this post sometime post 2016 things may had changed):
shared NFS, where the Active and Standby NameNode are actually working on the same files (image and log). See HDFS HighAvailability using NFS.
Quorum Journal Manager, where the active and passive NameNode both rely on a new service, a set of minimum 3 JournalNodes that provide a quorum for log edits. See HDFS High Availability Using the Quorum Journal Manager.
For both of these configurations, the documentation explicitly calls out the answer to your question:
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error. This also allows one who is reconfiguring a non-HA-enabled HDFS cluster to be HA-enabled to reuse the hardware which they had previously dedicated to the Secondary NameNode.
I'm designing a system where a cluster of EC2 instances do some computing and then update a large file continually. What would be ideal is if I could have the file in S3, and have all the instances take turns writing to it one at a time, performing calculations while they wait.
As it stands if 2 instances PUT to S3 at the same time, 1 will simply override the other.
How can I solve this concurrency issue?
AWS has a preview service called EFS (http://aws.amazon.com/documentation/efs/) that is an NFS4 that can be shared among EC2 instances. But such service alone does not solve your problem as you may still have concurrency issues. Consider having something more sophisticated such as exploiting "embarrassingly parallel processing" such as having N processes creating N file chunks and finally having a single file joining all pieces together when everything is done.
As it is Amazon states that if you receive a success code then your S3 object is committed. Amazon also adds that there wouldn't be any dirty writes or overlapping inconsistency - you would read either of a fully committed write.
If you need more control you might be able to do it application like implementing a critical section.
It certainly makes sense to enable versioning the bucket so that you get to maintain all the writes and later you can specify which version as the latest.
You can also leverage the life cycle rules delete ( keep deleting ) the last n version to save cost.
Using HDFS for Nuodb as storage. Would this have a performance impact?
If I understand correctly, HDFS is better suited for batch mode or write once and read many times, types of application. Would it not increase the latency for record to be fetch in case it needs to read from storage?
On top of that HDFS block size concept, keep the file size small that would increase the network traffic while data is being fetch. Am I missing something here? Please point out the same.
How would Nuodb manage these kind of latency gotchas?
Good afternoon,
My name is Elisabete and I am the Technical Support Engineer over at NuoDB. I believe that I may have just answered this via your post on our own forum, but I'm responding here as well for anyone else who's curious.
First... a mini lesson on NuoDB architecture/layout:
The most basic NuoDB set-up includes:
Broker Agent
Transaction Engine (TE)
Storage Manager (SM) connected to an Archive Directory
Broker Agents keep track of all the moving parts in the domain (collection of machines hosting NuoDB processes) and provide client applications with connection information for the next available Transaction Engine.
Transaction Engines process incoming SQL requests and manage transactions.
Storage Managers read and write data to and from "disk" (Archive Directory)
All of these components can reside on a single machine, but an optimal set up would have them spread across multiple host machines (allowing each process to take full advantage of the host's available CPU/RAM). Also, while it's possible to run with just one of each component, this is a case where more is definitely more. Additional Brokers provide resiliency, additional TE's increase performance/speed and additional SM's ensure durability.
Ok, so now lets talk about Storage:
This is the "Archive Directory" that your storage manager is writing to. Currently, we support three modes of storage:
Local Files System
Amazon Web Services: Simple Storage volume (S3), Elastic Block Storage (EBS)
Hadoop Distributed Files System (HDFS)
So, to elaborate on how NuoDB works with HDFS... it doesn't know about the multiple machines that the HDFS layer is writing to. As far as the SM is concerned, it is reading and writing data atoms to a single directory. The HDFS layer decides how to then distribute and retrieve data to and from the cluster of machines it resides over.
And now to finally address the question of latency:
Here's the thing, whenever we introduce a remote storage device, we inevitably introduce some amount of additional latency because the SM now has further to go when reading/writing atoms to/from memory. HDFS likely adds a bit more, because now it needs to do it's magic divvying up, distributing, retrieving and reassembling data. Add to that discrepancy in network speed, etc.
I imagine that the gained disk space outweighs the cost in travel time, but this is something you'd have to decide on a case by case basis.
Now, all of that said... I haven't mentioned that TE and SM's both have the ability to cache data to local memory. The size of this cache is something you can set, when starting up each process. NuoDB uses a combination of Multi-Version Concurrency Control (MVCC) and a near constant stream of communication between all of the processes, to ensure that data held in cache is kept up to date with all of the changes happening within the system. Garbage Collection also kicks in and clears out atoms in a Least Recently Used order, when the cache grows close to hitting its limit.
All of this helps reduce latency, because the TE's can hold onto the data they reference most often and grab copies of data they don't have from sibling TE's. When they do resort to asking the SM's for data, there's a chance that the SM (or one of its sibling SM's) has a copy of the requested data in local cache, saving itself the trip out to the Archive Directory.
Whew.. that was a lot and I absolutely glossed over more than a few concepts. These topics are covered in greater depth via the new suite of white papers (and the new "green book") available on our main website. I'm currently also working on some visual guides, to help explain all of this.
If you'd like to know more about NuoDB or if I didn't quite answer your question.... please reach out to me directly via the NuoDB Community Forums (I respond to posts there, a bit faster).
Thank you,
Elisabete
Technical Support Engineer at NuoDB