I'm wondering if it is possibile to write a java program that do a BulkLoad on HBase. I'm on a hadoop cluster but I don't need to write a MapReduce Job for some reason.
Thanks
BulkLoad works with HFile. So If you have HFiles, you can directly use LoadIncrementalHFiles to handle the bulk load.
Generally we use Map reduce, which can convert the data into above format, and perform Bulk Load.
If you have csv file, you can use ImportTsv utility to process your data into HFiles. use this link, for more information
It depends at which format you data is in currently.
Point to note is, Bulk Load, do not use Write ahead Logs(WAL). They skip this step and add data at a faster rate. if you have any other framework depending on the above WAL, consider other options of adding data in Hbase. Happy Coding.
Related
i am trying to generate SNMP data for printers for later analysis using a prediction algorithm to be able to fortell emanating faults in printers before they actually occur. I seek advice on how best i could collect the data and prepare it in a dataset format like .csv so as to feed it into my classifier.
Would really appreciate any help rendered
Cheers!
My approach might not be the most efficient one but it is possible to start with and later improve it.
What I would do in your case would be the following:
1) Create a python script that polls every printer, you need to poll. This using Pysnmp.
2) What I don't understand is where you want to collect your data from but anyways, you can import csv in your poller script and create a csv file if that is what you want. Or if you want that data inserted into a sql database eg MySQL you can push the data as well from your script.
Hope this helps:)
I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.
It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?
I am new to Map reduce program.I want to know if I can run map reduce program as a normal java program without using Hadoop. What all libraries should I include?Is it possible?
It is possible, but in that you need to write each end every code block starting from map-->SS-->Reduce. Tobe very simple hadoop is a framework built on provides lot API to run the mapreduce job. It will take care of passing the input from file, Suffle and sort and then reduce function. you just need to understand the various API of haddop and the flow of data thats it.
Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?
You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom
First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.
I have a binary file I'm creating in C++, I'm tasked to create a metadata format to describe the data that it can be read in Java using the metadata.
One record in the data file has Time, then 64 bytes of data, then a CRC, then a new line delimiter. How should the metadata look to describe what is in the 64 bytes? I've never created a metadata file before.
Probably you want to generate a file which describes how many entries there are in the data file, and maybe the time range. Depending on what kind of data you have, the metadata might contain either a per-record entry (RawData, ImageData, etc.) or one global entry (data stored as float.)
It totally depends on what the Java-code is supposed to do, and what use-cases you have. If you want to know whether to open the file at all depending on date, that should be part of the metadata, etc.
I think that maybe you have the design backwards.
First, think about the end.
What result do you want to see? A Java program will create some kind of .csv file?
What kind(s) of file(s)?
What information will be needed to do this?
Then design the metadata to provide the information that is needed to perform the necessary tasks (and any extra tasks you anticipate).
Try to make the metadata extensible so that adding extra metadata in the future will not break the programs that you are writing now. e.g. if the Java program finds metadata it doesn't understand, it just skips it.