How are documents retrieved after reduce produces the output?

How are documents retrieved after reduce produces the output? - mapreduce

So, after reduce completes its job we have data stored in the files something like this:
But what happens when the user types something? How is search performed when the data is stored just in files?

MapReduce is for processing. So once you have processed the data and generated your aggregate information, which is on HDFS, you will either have to read the file in some program to display to user. Or several alternative options are available to read the data from HDFS :
You could use Hive and create a table on top of this data and read the data using SQL like queries. A simple web application can connect to this using the thrift server which provides a JDBC interface to hive.
Other options include loading data to HBase, Shark etc. All depends on what your use case is interms of the size of the aggregated data, performance requirements

What you have constructed after MapReduce is a inverted index, a nice little data structure. Now you have to use it.
For example, in case of google, this inverted index is sharded across many servers and stores the entire list on each of them. So for example, server 500 has the list for be, and another has the list for to. These are implementation details, you could theoretically store it on one box in a large hash if you could hold the index in memory.
When the customer types in words into the engine. It will retrieve that entire list. If there are multiple words, it will do an intersection of those lists to show you documents that have both words.
Here is the source for the full paper on how they did it http://infolab.stanford.edu/~backrub/google.html
See "Figure 4. Google Query Evaluation"

Related

What are the differences between Object Storages for example S3 and a columnar based Technology

I was thinking about the difference between those two approches.
Imagine you must handle information about pattern calls, which later should be
displayed to the user. A pattern call is a tuple consisting of a unique integer
identifier ("id"), a user defined name (“name"), a project relative path to the so
called pattern file ("patternFile") and a convenience flag, which states whether
the pattern should be called or not called. And the number of tuples are not known before and they won't be modified after initialization.
I thought that in this case a column based approach with big query for example would be better in terms of I/O and performance as well as the evolution of the schema. But actually I can't understand why. I would appreciate any help.

Amazon S3 is like a large key-value store. The Key is the filename (with full path) and the Value is the contents of the file. It's just a blob of data.
A columnar data store organizes data in such a way that specific data can be "jumped to", and only desired values need to be read from disk.
If you are wanting to perform a search on the data, then some form of logic is required on the data. This could be done by storing data in a database (typically a proprietary format) or by using a columnar storage format such as Parquet and ORC plus a query engine that understands this format (eg Amazon Athena).
The difference between S3 and columnar data stores is like the difference between a disk drive and an Oracle database.

Data Storage and Retrieval in Relational Database

I'm starting a project- Mini Database System, basically a small database like MySQL. I'm planning to use C++, I read several articles and understood that tables will be stored and retrieved using files. Further I need to use B+ trees for accessing and updating of data.
Can someone explain me with example how data will be actually stored inside files,
For example I've a database "test" with table "student" in it.
student(id,name,grade,class) with some of the student entries. So how the entries of this table will be stored inside the file, whether it will stored in single file, or divided into files if later, then how ?

A B+Tree on disk is a bunch of fixed-length blocks. Your program will read/write whole blocks.
Within a block, there are a variable number of records. Those are arranged by some mechanism of your choosing, and need to be ordered in some way.
"Leaf nodes" contain the actual data. In "non-leaf nodes", the "records" contain pointers to child nodes; this is the way BTrees work.
B+Trees have the additional links (and maintenance hassle) of chaining blocks at the same level.
Wikipedia has some good discussions.

BigQuery tabledata:list output into a bigquery table

I know there is a way to place the results of a query into a table; there is a way to copy a whole table into another table; and there is a way to list a table piecemeal (tabledata:list using startIndex, maxResults and pageToken).
However, what I want to do is go over an existing table with tabledata:list and output the results piecemeal into other tables. I want to use this as an efficient way to shard a table.
I cannot find a reference to such a functionality, or any workaround to it for that matter.

Important to realize: Tabledata.List API is not part of BQL (BigQuery SQL) but rather BigQuery API that you can use in client of your choice.
That said, the logic you outlined in your question can be implemented in many ways, below is an example (high level steps):
Calling Tabledata.List within the loop using pageToken for next iteration or for exiting loop.
In each iteration, process response from Tabledata.List, extract actual data and insert into destination table using streaming data with Tabledata.InsertAll API. You can also have inner loop to go thru rows extracted in given iteration and define which one to go to which table/shard.
This is very generic logic and particular implementation depends on client you use.
Hope this helps

For what you describe, I'd suggest you use the batch version of Cloud Dataflow:
https://cloud.google.com/dataflow/
Dataflow already supports BigQuery tables as sources and sinks, and will keep all data within Google's network. This approach also scales to arbitrarily large tables.
TableData.list-ing your entire table might work fine for small tables, but network overhead aside, it is definitely not recommended for anything of moderate size.

Amazon DynamoDB Mapper - limits to batch operations

I am trying to write a huge number of records into a dynamoDB and I would like to know what is the correct way of doing that. Currently, I am using the DynamoDBMapper to do the job in a one batchWrite operation but after reading the documentation, I am not sure if this is the correct way (especially if there are some limits concerning the size and number of the written items).
Let's say, that I have an ArrayList with 10000 records and I am saving it like this:
mapper.batchWrite(recordsToSave, new ArrayList<BillingRecord>());
The first argument is the list with records to be written and the second one contains items to be deleted (no such items in this case).
Does the mapper split this write into multiple writes and handle the limits or should it be handled explicitly?
I have only found examples with batchWrite done with the AmazonDynamoDB client directly (like THIS one). Is using the client directly for the batch operations the correct way? If so, what is the point of having a mapper?

Does the mapper split your list of objects into multiple batches and then write each batch separately? Yes, it does batching for you and you can see that it splits the items to be written into batches of up to 25 items here. It then tries writing each batch and some of the items in each batch can fail. An example of a failure is given in the mapper documentation:
This method fails to save the batch if the size of an individual object in the batch exceeds 400 KB. For more information on batch restrictions see, http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
The example is talking about the size of one record (one BillingRecord instance in your case) exceeding 400KB, which at the time of writing this answer, is the maximum size of a record in DynamoDB.
In the case a particular batch fails, it moves on to the next batch (sleeping the thread for a bit in case the failure was because of throttling). In the end, all of the failed batches are returned in List of FailedBatch instances. Each FailedBatch instance contains a list of unprocessed items that weren't written to DynamoDB.
Is the snippet that you provided the correct way for doing batch writes? I can think of two suggestions. The BatchSave method is more appropriate if you have no items to delete. You might also want to think about what you want to do with the failed batches.
Is using the client directly the correct way? If so, what is the point of the mapper? The mapper is simply a wrapper around the client. The mapper provides you an ORM layer to convert your BillingRecord instances into the sort-of nested hash maps that the low-level client works with. There is nothing wrong with using the client directly and this does tend to happen in some special cases where additional functionality needed needs to be coded outside of the mapper.

Non-permanent huge external data storage in C++ application

I'm rewriting an application which handles a lot of data (about 100 GB) which is designed as a relational model.
The application is very complex; it is some kind of conversion tool for open street map data of huge sizes (the whole world) and converts it into a map file for our own route planning software. The converter application for example holds the nodes in the open street map with their coordinate and all its tags (a lot of more than that, but this should serve as an example in this question).
Current situation:
Because this data is very huge, I split it into several files: Each file is a map from an ID to an atomic value (let's assume that the list of tags for a node is an atomic value; it is not but the data storage can treat it as such). So for nodes, I have a file holding the node's coords, one holding the node's name and one holding the node's tags, where the nodes are identified by (non-continuous) IDs.
The application once was split into several applications. Each application processes one step of the conversion. Therefore, such an application only needs to handle some of the data stored in the files. For example, not all applications need the node's tags, but a lot of them need the node's coords. This is why I split the relations into files, one file for each "column".
Each processing step can read a whole file at once into a data structure within RAM. This ensures that lookups can be very efficient (if the data structure is a hash map).
I'm currently rewriting the converter. It should now be one single application. And it should now not use separated files for each "column". It should rather use some well-known architecture to hold external data in a relational manner, like a database, but much faster.
=> Which library can provide the following features?
Requirements:
It needs to be very fast in iterating over the existing data (while not modifying the set of rows, but some values in the current row).
It needs to provide constant or near-constant lookup, similar to hash maps (while not modifying the whole relation at all).
Most of the types of the columns are constantly sized, but in general they are not.
It needs to be able to append new rows to a relation in constant or logarithmic time per row. Live-updating some kind of search index will not be required. Updating (rebuilding) the index can happen after a whole processing step is complete.
Some relations are key-value-based, while others are an (continuously indexed) array. Both of them should provide fast lookups.
It should NOT be a separate process, like a DBMS like MySQL would be. The number of queries will be enormous (around 10 billions) and will be totally the bottle neck of the performance. However, caching queries would be a possible workaround: Iterating over a whole table can be done in a single query while writing to a table (from which no data will be read in the same processing step) can happen in a batch query. But still: I guess that serializing, inter-process-transmitting and de-serializing SQL queries will be the bottle neck.
Nice-to-have: easy to use. It would be very nice if the relations can be used in a similar way than the C++ standard and Qt container classes.
Non-requirements (Why I don't need a DBMS):
Synchronizing writing and reading from/to the same relation. The application is split into multiple processing steps; every step has a set of "input relations" it reads from and "output relations" it writes into. However, some steps require to read some columns of a relation while writing in other columns of the same relation.
Joining relations. There are a few cross-references between different relations, however, they can be resolved within my application if lookup is fast enough.
Persistent storage. Once the conversion is done, all the data will not be required anymore.
The key-value-based relations will never be re-keyed; the array-based relations will never be re-indexed.

I can think of several possible solutions depending on lots of factors that you have not quantified in your question.
If you want a simple store to look things up and you have sufficient disk, SQLite is pretty efficient as a database. Note that there is no SQLite server, the 'server' is linked into your application.
Personally this job smacks of being embarrassingly parallel. I would think that a small Hadoop cluster would make quick work of the entire job. You could spin it up in AWS, process your data, and shut it down pretty inexpensively.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js