What are the differences between Object Storages for example S3 and a columnar based Technology

What are the differences between Object Storages for example S3 and a columnar based Technology - amazon-web-services

I was thinking about the difference between those two approches.
Imagine you must handle information about pattern calls, which later should be
displayed to the user. A pattern call is a tuple consisting of a unique integer
identifier ("id"), a user defined name (“name"), a project relative path to the so
called pattern file ("patternFile") and a convenience flag, which states whether
the pattern should be called or not called. And the number of tuples are not known before and they won't be modified after initialization.
I thought that in this case a column based approach with big query for example would be better in terms of I/O and performance as well as the evolution of the schema. But actually I can't understand why. I would appreciate any help.

Amazon S3 is like a large key-value store. The Key is the filename (with full path) and the Value is the contents of the file. It's just a blob of data.
A columnar data store organizes data in such a way that specific data can be "jumped to", and only desired values need to be read from disk.
If you are wanting to perform a search on the data, then some form of logic is required on the data. This could be done by storing data in a database (typically a proprietary format) or by using a columnar storage format such as Parquet and ORC plus a query engine that understands this format (eg Amazon Athena).
The difference between S3 and columnar data stores is like the difference between a disk drive and an Oracle database.

Related

What is AWS S3 dataset?

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?

The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Why are tables segmented when exporting to parquet from AWS RDS

We use Python's boto3 library to execute start_export_task to trigger a RDS snapshot export (to S3). This successfully generates a directory in S3 that has a predicable, expected structure. Traversing down through that directory to any particular table directory (as in export_identifier/database_name/schema_name.table_name/) I see several .parquet files.
I download several of these files and convert them to pandas dataframes so I can look at them. They are all structured the same and seem to clearly be pieces of the same table. But they range in size from 100KB to 8MB in seemingly unpredictable size segments. Do these files/'pieces' of the table account for all its rows? Do they repeat/overlap at all? Why are they segmented so (seemingly) randomly? What parameters control this segmenting?
Ultimately I'm looking for documentation on this part of parquet folder/file structure. I've found plenty of information on how individual files are structured and partitioning. But I think this falls slightly outside of those topics.

You're not going to like this, but from AWS' perspective this is an implementation detail and according to the docs:
The file naming convention is subject to change. Therefore, when reading target tables we recommend that you read everything inside the base prefix for the table.
— docs
Most of the tools that work with Parquet don't really care about the number or file names of the parquet files. You just point something like Spark or Athena to the prefix of the table and it will read all the files and figure out how they fit together.
In the API there are also no parameters to influence this behavior. If you prefer a single file for aesthetic reasons or others, you could use something like a Glue Job to read the table prefixes, coalesce the data per table in a single file and write it to S3.

Athena Query Results: Are they always strings?

I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!

The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.

C++ persistence in database

I would like to persist some of my objects in a database (this could be relational (postgresql or MariaDB) or MongoDB). I have found a number of libraries that seem potentially useful, but I am missing the overall picture.
I have used boost::serialization serialize c++ to xml / binary, but it is not clear to me how to get this into the database (do I use the binary or xml format?)?
How do I get this into my mongoDB or postgresql?

You'd serialize to binary, as it is smaller and much faster. Also, the XML format isn't really pretty/easy to use outside of Boost Serialization anyways.
WARNING: Use Boost Portable Archive (EPA http://epa.codeplex.com/) if you need to use the format across different machines.
You'd usually store it in a column
text or CLOB (character large object) by encoding in base64 and putting that in the Database native charset (base64 is safe even for ASCII)
BLOB (binary large object) which doesn't bring the need to encode and could be more efficient storage wise.
Note: if you need to index, store the index properties in normal database columns.
Finally, if you like, I have recently made a streambuffer that allows you to stream data directly into a Sqlite BLOB column. Perhaps you can glean some ideas from this you could use:
How to boost::serialize into a sqlite::blob?

How are documents retrieved after reduce produces the output?

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?

MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements

What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js