Preprocessing data in EMR

Preprocessing data in EMR - mapreduce

I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.

It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?

Related

Informatica Cloud Incremental Load

guys.
I'm needing create a mapping to do incremental loads in informatica cloud. I know that I can do that with parameter files and using the $Lastruntinme. But, if i use FF as parameters, those parameters can be deleted. Using the $Lastruntime i could have temporal gaps into the target.
Is there other ways to do incremental loads? Maybe using loockup, or a way to use two sources in the same mapping, one reading the last written data and the second source reading the the source data; after that, compare both and get the last.

Any mechanism that reliably allows you to identify which records in your source need to be loaded into your target could be used to build an incremental etl load - but without knowing your data it is impossible for anyone to tell you what would work for you.
You also need to distinguish what would work in principle and what would work in practice. For example, comparing your source and target datasets might work with small datasets but would quickly become impractical as the size of either dataset grew

AWS S3: distributed concatenation of tens of millions of json files in s3 bucket

I have an s3 bucket with tens of millions of relatively small json files, each less than 10 K.
To analyze them, I would like to merge them into a small number of files, each having one json per line (or some other separator), and several thousands of such lines.
This would allow me to more easily (and performantly) use all kind of big data tools out there.
Now, it is clear to me this cannot be done with one command or function call, but rather a distributed solution is needed, because of the amount of files involved.
The question is if there is something ready and packaged or must I pull out my own solution.

don't know of anything out there that can do this out of the box, but you can pretty easily do it yourself. the solution also depends a lot on how fast you need to get this done.
2 suggestions:
1) list all the files, split the list, download sections, merge and reupload.
2) list all the files, and after them go through them one at a time and read/download and write it to a kinesis steam. configure kinesis to dump the files to s3 via kinesis firehose.
In both scenarios the tricky bit is going to be handling failures and ensuring you don't get the data multiple times.
For completeness, if the files would be larger (>5MB) you could also leverage http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html which would allow you to merge files in S3 directly without having to download.

Assuming each json file is one line only, then I would do:
cat * >> bigfile
This will concat all files in a directory into the new file bigfile.
You can now read bigfile one line at a time, json decode the line and do something interesting with it.
If your json files are formatted for readability, then you will first need to combine all the lines in the file into one line.

Hbase BulkLoad without mapreduce

I'm wondering if it is possibile to write a java program that do a BulkLoad on HBase. I'm on a hadoop cluster but I don't need to write a MapReduce Job for some reason.
Thanks

BulkLoad works with HFile. So If you have HFiles, you can directly use LoadIncrementalHFiles to handle the bulk load.
Generally we use Map reduce, which can convert the data into above format, and perform Bulk Load.
If you have csv file, you can use ImportTsv utility to process your data into HFiles. use this link, for more information
It depends at which format you data is in currently.
Point to note is, Bulk Load, do not use Write ahead Logs(WAL). They skip this step and add data at a faster rate. if you have any other framework depending on the above WAL, consider other options of adding data in Hbase. Happy Coding.

How to read/restore big data file (SEGY format) with C/C++?

I am working on a project which needs to deal with large seismic data of SEGY format (from several GB to TB). This data represents the 3D underground structure.
Data structure is like:
1st tract, 2,3,5,3,5,....,6
2nd tract, 5,6,5,3,2,....,3
3rd tract, 7,4,5,3,1,....,8
...
What I want to ask is, in order to read and deal with the data fast, do I have to convert the data into another form? Or it's better to read from the original SEGY file? And is there any existing C package to do that?

If you need to access it multiple times and
if you need to access it randomly and
if you need to access it fast
then load it to a database once.
Do not reinvent the wheel.

When dealing of data of that size, you may not want to convert it into another form unless you have to - though some software does do just that. I found a list of free geophysics software on Wikipedia that look promising; many are open source and read/write SEGY files.
Since you are a newbie to programming, you may want to consider if the Python library segpy suits your needs rather than a C/C++ option.

Several GB is rathe medium, if we are toking about poststack.
You may use segy and convert on the fly, you may invent your own format. It depends whot you needed to do. Without changing segy format it's enough to createing indexes to traces. If segy is saved as inlines - it's faster access throug inlines, although crossline access is not very bad.
If it is 3d seismic, the best way to have the same quick access to all inlines/crosslines is to have own format - based od beans, e.g 8x8 traces - loading all beans and selecting tarces access time may be very quick - 2-3 secends. Or you may use SSD disk, or 2,5x RAM as your SEGY.
To quickly access timeslices you have 2 ways - 3D beans or second file stored as timeslices (the quickes way). I did same kind of that 10 years ago - access time to 12 GB SEGY was acceptable - 2-3 seconds in all 3 directions.
SEGY in database? Wow ... ;)

The answer depends upon the type of data you need to extract from the SEG-Y file.
If you need to extract only the headers (Text header, Binary header, Extended Textual File headers and Trace headers) then they can be easily extracted from the SEG-Y file by opening the file as binary and extracting relevant information from the respective locations as mentioned in the data exchange formats (rev2). The extraction might depend upon the type of data (Post-stack or Pre-stack). Also some headers might require conversions from one format to another (e.g Text Headers are mostly encoded in EBCDIC format). The complete details about the byte locations and encoding formats can be read from the above documentation
The extraction of trace data is a bit tricky and depends upon various factors like the encoding, whether the no. of trace samples is mentioned in the trace headers, etc. A careful reading of the documentation and getting to know about the type of SEG data you are working on will surely make this task a lot easier.
Since you are working with the extracted data, I would recommend to use already existing libraries (segpy: one of the best python library I came across). There are also numerous free available SEG-Y readers, a very nice list has already been mentioned by Daniel Waechter; you can choose any one of them that suits your requirements and the type file format supported.
I recently tried to do something same using C++ (Although it has only been tested on post-stack data). The project can be found here.

How to parse freebase quad dump using Amazon mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?

You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom

First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js