Storing and processing a lot of tiny files - mapreduce

There are a lot of files with size from 1Kb to 5Mb on our servers. Total size of those files is about 7Tb. Process algorithm - read and make some decisions about this file. Files may have several formats: doc, txt, png, bmp and etc. Therefore I can't merge those files to get bigger files.
How I can effectively store and process those files? What technology fits well to this task?

You can use various technologies to store and process these files. Below mentioned are the technologies that you can use.
1 Apache Kafka: You can create different topics for each format and push your data in these topics.
Advantage :
Based on your load you can easily increase your consumption speed.
2 Hadoop: you can store your data in hdfs the format and can design MR jobs to process.
3 You can use any document storage NOSQL database to store your data
Note: All the above solutions will store your data in distributed format and you can run it on commodity machines
Store your data in clouds(AWS, Google, Azure) and use there API to get and process the data. (If you want your data to be shared with the other applications also)

Start by segregating files into Different directories based on types. You can even have partition withing the individual directories. Example /data/images/YYYY-MM-DD , /data/text/YYYY-MM-DD
Use multipleInputs with appropriate InputFormat for each Path.
Normalize the data into a generic format before sending it to the reducer if needed.
There are ways to ingest data for your need .
Use Kafka to store data under different topics based on type(image , text ) and then copy to hdfs from kafka
Use Flume
As you have huge amount of data ,
please rollup the data in HDFS on a weekly basis . You can use oozie or falcon to automate the weekly rollup process
Use CombinedInPutFormat in your Spark or MR code.
Last but not the least map the data as table using Hive to expose it to external clients.

Hadoop archieves (HAR) is usual way to address this.
More details about this are available on : https://hadoop.apache.org/docs/r2.7.0/hadoop-archives/HadoopArchives.html
You also have option to use SequenceFile, HBase as described in : https://blog.cloudera.com/blog/2009/02/the-small-files-problem/
But, looking at your usecase HAR fits the bill.

Related

Architecture to process AWS S3 files

I am working on a POC where we have millions of existing S3 compressed json files (uncompressed 3+ MB, with nested objects and arrays) and more being added every few minutes. We need to perform computations on top of the uncompressed data (per file basis) and store it to a DB table where we can then perform some column operations. The most common solution I found online is
S3 (Add/update event notification) => SQS (main queue => dlq queue) <=> AWS lambda
We have a DB table for all S3 bucket key names that are being successfully loaded, so I can query this table and use the AWS SDK Node.js package to send messages to the SQS main queue. For newly added/updated files, S3 event notification will take care of it.
I think the above architecture will work in my case, but are there any other AWS services I should look at?
I looked at AWS Athena which can read my compressed files and can give me the raw output but since I have big nested objects and arrays on top of which I need to perform computation, I am not sure if it's ideal to write such complex logic in SQL.
I would really appreciate some guidance here.
If you plan to query the data in the future in ways you can't anticipate, I would strongly suggest you explore the Athena solution, since you would be plugging a very powerful SQL engine on top of your data. Athena can query directly compressed json and export to other data formats that are a lot more efficient to query (like parquet or orc) and support complex data structures.
The flow would be:
S3 (new file) => Athena ETL (json to, say, parquet)
see e.g. here.
For already existing data you can do a one-off query to convert it to the appropriate format (partitioning would be useful if your data volume is big as it seems it is). Having good partitioning is key to obtain good performance on Athena and you will need to think carefully about it on your ETL. More on partitioning, e.g., there.

Building Google Cloud Platform Data Catalog on unstructured data

I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?
This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.
I had a similar use case, Google Data Catalog has an option to create custom entries.
Some tips on building a Data Catalog on unstructured files data:
Use meaningful file names on your JSON files. That way searching for them will become easier.
Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.
I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.
In case you are wondering how to do the step 2, I put together one script that automatically does that:
link for the GitHub. Another option is to work with Data Catalog Filesets.
So between using custom entries or filesets, I'd ask you this, do you need information about your files name?
If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.
The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.

Incremental updates of data in an S3 data lake

I'm new to AWS and coming from a Data Warehousing ETL background. We are currently moving to cloud using AWS services Data Lake and trying to load data into Amazon s3 landing layer (Bucket) from our external source RDBMS system using sqoop jobs and then to different layers (Buckets) in Amazon S3 using Informatica BDM.
The frequency of getting data from external source system is daily. I'm not sure how do we have to implement Delta load/SCD Types in S3. Is there any possibility to change an object after creating it in Amazon S3 bucket or do we have to keep creating copy of everyday load as an object in s3 bucket?
I understand Amazon gives us database options but we are directed to load data into Amazon S3.
Amazon S3 is simply a storage system. It will store whatever data is provided.
It is not possible to 'update' an object in Amazon S3. An object can be overwritten (replaced), but it cannot be appended.
Traditionally, information in data lakes are appended by adding additional files, such as a daily dump of information. Systems that process data out of the data lake normally process multiple files. In fact, this is a more efficient process since data can be processed in parallel rather than attempting to read a single, large file.
So, your system can either do a new, complete dump that replaces data or it can store additional files with the incremental data.
Another common practice is to partition data, which puts files into different directories such as a different directory per month or day or hour. This way, when a system processes data in the data lake, it only needs to read files in the directories that are known to contain data for a given time period. For example, if a query wishes to processes data for a given month, it only needs to read the directory with data for that month, thereby speeding the process. (Partitions can also be hierarchical, such as having directories for hour inside day inside month.)
To answer your question of "how do we have to implement Delta load/SCD Types in S3", it really depends on how you will use the data once it is in the data lake. It would be good to store the data in a manner that helps the system that will eventually consume it.

Comparison of loading from different file formats in BigQuery

We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could just tell me in the answer) how loading different formats would compare in efficiency.
For example, if we have the same 100M rows of data, does BigQuery show any performance difference from loading it in:
parquet
csv
json
avro
I'm sure one of the answers will be "why don't you test it", but we're hoping that before architecting a converter or re-writing our application, an engineer could share with us what (if any) of the above formats would be the most performant in terms of loading data from a flat file into BQ.
Note: all of the above files would be stored in Google Cloud Storage: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage.
https://cloud.google.com/blog/big-data/2016/03/improve-bigquery-ingestion-times-10x-by-using-avro-source-format
"Improve BigQuery ingestion times 10x by using Avro source format"
The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.
In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values
Take care not to limit your comparison to just benchmarks. Those formats also imply some limitations for the client that writes data into BigQuery, and you should also consider them. For instance:
Size of the allowed compressed files (https://cloud.google.com/bigquery/quotas#load_jobs )
CSV is quite "fragile" has for serialization format (no control of types for instance)
Avro offers poor support for types like Timestamp, Date, Time.

What big data tools or approach to be used

I have a central data store in AWS . I wanted to access multiple tables in that database and find patterns and predictions on those collection of data.
my tables have several transactional data like call details,marketing campaign details,contact information of people etc.
How to integrate all this data for a big data analysis to find the relationship and store them efficiently
I am confused whether to use Haddop or not, which architecture would be perfect
The most easiest way for you to start is to export tables you wish to analyze into a csv file and process it using Amazon Machine Learning.
The following guide describes entire process:
http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html