What is namespace in aerospike - in-memory-database

I'm new to Aerospike.
What is a namespace and how to create a namespace in aerospike?

A Namespace is a top level container for data in Aerospike. The most important part of a namespace configuration is the storage definition (RAM only, RAM + Persistence on disk, or disk only - usually Flash storage). You can also configure other things at the namespace level, like the data retention policy (default ttl and high water mark to protect against running out of disk or memory). I would recommend reading this page for details.
You need to have at least 1 namespace defined in your cluster. Also, you cannot dynamically add or remove namespaces in a cluster. In order to add or remove a namespace in a cluster, you have to stop all the nodes (at this point), change the configuration on all the nodes (IMPORTANT - configuration should match on all nodes) and then restart the nodes one by one.
For more details on configuration of a namespace, you should go through this page. (Already mentioned in another response to your question).

a namespace is a Database name in Aerospike, and set is the Table name and Bins are the columns in Aerospike.

In simple terms,
namespaces are semantically similar to databases in an RDBMS system. Within a namespace, data is subdivided into sets (similar to tables) and records (similar to rows).

If you want to dynamically create Namespace then you can do something like below-
- Modify the aerospace.conf file on the server side .
- Then restart the cluster which should restart all the nodes.
For more go to this link
http://www.aerospike.com/docs/operations/configure/namespace/

Namespaces are the top level containers for data. A namespace can actually be a part of a database or it can be a group of databases as you would think of them in a standard RDBMS – the reason you collect data into a namespace relates to how the data is going to be stored and managed.
A namespace contains records, indexes and policies. A policy dictates the behavior of the namespace, including:
How data is stored: DRAM or disk
How many replicas should exists for a record.
When records should expire.
For a detailed study on the Data model and architecture of Aerospike read the following link: http://www.aerospike.com/docs/architecture/data-model.html

Related

what does it mean "partitioned data" - S3

I want to use Netflix's outputCommitter (Using Spark with Amazon EMR).
In the README there are 2 options:
S3DirectoryOutputCommitter - for writing unpartitioned data to S3 with conflict resolution.
S3PartitionedOutputCommitter - for writing partitioned data to S3 with conflict resolution.
I tried to understand the differences but unsuccessfully. Can someone explain what is "partitioned data" in s3?
according to the hadoop docs, "This committer an extension of the “Directory” committer which has a special conflict resolution policy designed to support operations which insert new data into a directory tree structured using Hive’s partitioning strategy: different levels of the tree represent different columns."
search in the hadoop docs for the full details.
be aware that the EMR committers are not the ASF S3A ones, so take different config options and have their own docs. but since their work is a reimplementation of the netflix work, they should do the same thing here
I'm not familiar with outputCommitter, by partitioned data in Amazon S3 normally refers to splitting files amongst directories to reduce the amount of data that needs to be read from disk.
For example:
/data/month=1/
/data/month=2/
/data/month=3/
...
If a Hive-type query is run against the data with a clause like WHERE month=1, then it would only need to look in the month=1/ subdirectory, thereby saving 2/3rds of disk access.

GCP Data Catalog - ONE for all projects (of one and more orgs)

What is the best practice
to get a company wide (one or more organisations each with multiple folders and projects)
INTO one central and all metadata contained data catalog ?
(if "multiple orgs" is too complex than let's start with one)
I've put together a sample showing how to work with one organization.
The main ideia is to use a Tag Central Project, where you store the common resources, like Tag Templates, Policy Tags and Custom Entries that could be reused.
So you have:
Tag Central Project
List of Analytics Projects (Where you have the data assets)
Then the next thing is the user personas you would use, I'd suggest starting with:
Data Governors
Data Curators
Data Analysts
This google-datacatalog-governance-best-pratices GitHub repo contains the code which uses terraform to automatically set up those governance best practices I mentioned.
You can adapt those samples to work at folder or organization level by changing the terraform resources.

AWS S3 Block Size to calculate total number of mappers for Hive Workload

Does S3 stores the data in form of blocks? if yes, what is the default block size? Is there a way to alter the block size?
Block Size is not applicable to Amazon S3. It is an object storage system, not a virtual disk.
There is believed to be some partitioning of uploaded data into the specific blocks it was uploaded -and if you knew those values then readers may get more bandwidth. But certainly the open source hive/spark/mapreduce applications don't know the API calls to find this information out or look at these details. Instead the S3 connector takes some configuration option (for s3a: fs.s3a.block.size) to simulate blocks.
It's not so beneficial to work out that block size if it took an HTTP GET request against each file to determine the partitioning...that would slow down the (Sequential) query planning before tasks on split files were farmed out to the worker nodes. HDFS lets you get the listing and block partitioning + location in one API call (listLocatedStatus(path)); S3 only has a list call to return the list of (objects, timestamps, etags) under a prefix, (S3 List API v2) so that extra check would slow things down. If someone could fetch that data and show that there'd be benefits, maybe it'd be useful enough to implement. For now, calls to S3AFIleSystem.listLocatedStatus() against S3 just get some made up list of locations splitting of blocks by the fs.s3a.block.size value and with the location (localhost). All the apps known that location == localhost means "whatever"

Organizing files in S3

I have a social media web application. Users upload pictures such as profile picture, project pictures, and etc. What's the best way to organize these files in a S3 bucket?
I thought of creating a folder with userid as its name inside the bucket and the inside that multiple other folders i.e. profile, projects and etc.
Not sure if that's the best approach to follow!
The names (Keys) you assign an object in Amazon S3 are frankly irrelevant.
What matters is that you have a database that tracks the objects, their ownership and their purpose.
You should not use the filename (Key) of an Amazon S3 object as a way of storing information about the object, because your application might have millions of objects in S3 and it is too slow to scan the list of objects to see which ones exist. Instead, consult a database to find them.
To answer your question: Yes, create a prefix by username if you wish, but then just give it a unique name (eg a Universally unique identifier - Wikipedia) that avoids name clashes.
Earlier there used to be a need to add random prefixes for better performance. More details here and here.
Following is the extract from one of that pages
Pay Attention to Your Naming Scheme If:
Distributing the Key names
Don’t save your object's key name starts with a date or standard key
names, it improves complexity in the S3 indexing and will reduce
performance, because based on the indexing objects saves in the single
storage partition .
Amazon S3 maintains keys lexicographically in its internal indices.
However, as of 17 Jul 2018 announcement, adding random prefix to S3 key isn't required for improving the performance

How exactly does Spark on EMR read from S3?

Just a few simple questions on the actual mechanism behind reading a file on s3 into an EMR cluster with Spark:
Does spark.read.format("com.databricks.spark.csv").load("s3://my/dataset/").where($"state" === "WA") communicate the whole dataset into the EMR cluster's local HDFS and then perform the filter after? Or does it filter records when bringing the dataset into the cluster? Or does it do neither? If this is the case, what's actually happening?
The official documentation lacks an explanation of what's going on (or if it does have an explanation, I cannot find it). Can someone explain, or link to a resource with such an explanation?
I can't speak for the closed source AWS one, but the ASF s3a: connector does its work in S3AInputStream
Reading data is via HTTPS, which has slow startup time, and if you need to stop the download before the GET is finished, forces you to abort the TCP stream and create a new one.
To keep this cost down the code has features like
Lazy seek: when you do a seek(), it updates its internal pointer but doesn't issue a new GET until you actually do a read.
chooses whether to abort() vs read to end on a GET based on how much is left
Has 3 IO modes:
"sequential", GET content range is from (pos, EOF). Best bandwidth, worst performance on seek. For: CSV, .gz, ...
"random": small GETs, min(block-size, length(read)). Best for columnar data (ORC, Parquet) compressed in a seekable format (snappy)
"adaptive" (new last week, based on some work from microsoft on the Azure WASB connector). Starts off sequential, as soon as you do a backwards seek switches to random IO
Code is all there, improvements welcome. The current perf work (especially random IO) based on TPC-DS benchmarking of ORC data on Hive, BTW)
Assuming you are reading CSV and filtering there, it'll be reading the entire CSV file and filtering. This is horribly inefficient for large files. Best to import into a column format and use predicate pushdown for the layers below to seek round the file for filtering and reading columns
Loading data from S3 (s3://-) usually goes via EMRFS in EMR
EMRFS directly access S3 (not via HDFS)
When Spark loads data from S3, they are stored as DataSet in the cluster according to StorageLevel(memory or disk)
Finally, Spark filters loaded data
When you specify files located on S3 they are read into the cluster. The processing happens on the cluster nodes.
However, this may be changing with S3 Select, which is now in preview.