I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.
It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?
I'm wondering if it is possibile to write a java program that do a BulkLoad on HBase. I'm on a hadoop cluster but I don't need to write a MapReduce Job for some reason.
Thanks
BulkLoad works with HFile. So If you have HFiles, you can directly use LoadIncrementalHFiles to handle the bulk load.
Generally we use Map reduce, which can convert the data into above format, and perform Bulk Load.
If you have csv file, you can use ImportTsv utility to process your data into HFiles. use this link, for more information
It depends at which format you data is in currently.
Point to note is, Bulk Load, do not use Write ahead Logs(WAL). They skip this step and add data at a faster rate. if you have any other framework depending on the above WAL, consider other options of adding data in Hbase. Happy Coding.
Assume here is the data set.....
Aspect Evaluation Quarter Percentage
HOST/HOSTESS DIVERSIONS /687 Excellent Q1 40%
ROCKIN' BAR D / WAVEBANDS/ EVOLUTION Excellent Q1 50%
KNOWLEDGE OF SERVER TEAM – ROTATION Excellent Q1 60%
Trying to generate below Excel Sheet with same color and Structure, assume the above percentage will be populated in “% Within” column ......
Any way to get the excel in this required format....?I appreciate any help...
Thanks,
Sam
If you're going to do color and such, you have a few options. PROC EXPORT won't do it, of course. So instead, you need to do either Excel Tagsets, DDE, or create an unformatted sheet and use a macro from a template to copy the colors in.
Benefits/Drawbacks:
Excel Tagsets:
Benefits: Make the exact format entirely in SAS code. Have a great deal of control with a fairly simple interface. Uses the powerful PROC TEMPLATE to define styles, which allows highly portable and reusable code.
Drawbacks: Makes an .xml file that is readable by excel, not actually a .xls/.xlsx file. Does have some limitations in what it can do. Can be buggy. Probably the slowest to code of the three options, unless you are very familiar with it.
DDE:
Benefits: Once you make the template (once) in Excel, can make exactly what you want fully in SAS. Can do 100% of what Excel does.
Drawbacks: Uses somewhat outdated method, so fewer SAS programmers are familiar with it. Requires Excel to be installed on the machine, and open (you can open it as part of the DDE program). Somewhat slower to copy data in, and requires more careful checking to verify data went where it should go. Requires knowing DDE commands.
Template/copy:
Benefits: Likely fastest method in terms of set up time. Can do everything exactly like what excel does. Easy for other programmers to understand, as long as they know Excel/VBA and SAS.
Drawbacks: requires outside-of-SAS step to run copy macro (could be called from SAS via DDE or batch file, but more commonly would be done by hand). Does require some knowledge of VBA as well as SAS.
In general, I recommend trying Excel Tagsets first; if they don't work for your needs, try either of the other two options. Some good papers on Excel Tagsets for the beginner:
http://support.sas.com/resources/papers/proceedings11/170-2011.pdf
http://support.sas.com/resources/papers/proceedings12/207-2012.pdf
http://www2.sas.com/proceedings/forum2008/036-2008.pdf
I think you could create the above pretty easily using excel tagsets and proc report; follow the first paper in particular as it seems to be the most similar to what you're doing. If you run into any issues, post them as separate questions and we should be able to help you out.
I have a huge text file (~5GB) which is the database for my program. During run this database is read completely many times with string functions like string::find(), string::at(), string::substr()...
The problem is that this text file cannot be loaded in one string, because string::max_size is definitely too small.
How would you implement this? I had the idea of loading a part to string->reading->closing->loading another part to same string->reading->closing->...
Is there a better/more efficient way?
How would you implement this?
With a real database, for instance SQLite. The performance improvement from having indexes is more than going to make up for your time learning another API.
Since this is a database, I'm assuming it'd have many records. That to me implies best idea would be to implement a data class for each records and populate a list/vector/etc depending upon how you plan to use it. I'd also look into persistent cache as the file is big.
And within in your container class of all records, you could implement search etc functions as you see fit. But as suggested for a db of this size, you're probably best of using a database.
I'm just learning C++, just started to mess around with QT, and I am sitting here wondering how most applications save data? Is there an industry standard? Do they store it in a XML file, text file, SQLite? What about sensitive data that say accounting software would need to save? I'm just interested in learning what the best practices for this are.
Thanks
This question is way too broad. The only answer is it depends on the nature of the particular application and the data, and whether or not it is written in C++ has very little to do with it.
For example, user-configurable application settings are often stored in text files, but on Windows they are typically stored in the Registry. Accounting applications typically keep their data in a database of some sort.
There are many good ways to store application data (call it serialization).
Personally, I think for larger datasets, using an open format is much, much easier for debugging. If you go with XML, for example, you can store your data in an open form so that if you have file corruption issues (i.e. a client can't open your file for some reason), it's easier to find. If you have sensitive data in there, you can always encrypt it before writing it to file using key encryption. Microsoft, for instance, has gone from using a proprietary format to open xml in their office docs. They use .*x extension (.docx, .xlsx, etc). It's really just a compressed folder with xml files.
Using binary serialization is, of course, the industry standard at the moment for most standalone applications. Most likely that is because of the application framework they are using (such as MFC, which is old). If you take a look at most of the serialization techniques in modern application frameworks, XML serialization is very well supported.
First you need to clarify what kind of data you would like to save.
If you just want to save some application settings, use QSettings to save your settings to an INI file or registry.
If it is much more than just some application settings, go for XML files or SQL.
There is no standard practice, however if you want to use complex structured data, consider using an embedded database engine such as SQLite or Metakit, or Berkeley DB files. XML files would also do the job and be human readable/writable. Preferences can use INI files or the Windows registry, and so on. In short, it really depends on your usage pattern.
This is a general question. Like many things, the right answer depends on your application and its needs.
Most desktop applications save end-user data to a file (think Word and Excel). The format is up to you, XML, binary, etc. And if you can serialize/deserialize objects to file it will probably make your life easier.
Internal application data such as configuration files or temporary data might be saved to an XML file or an lightweight, local database such as SQLite
Often, "enterprise" applications used internally by a business will save their data to a back-end database such as SQL Server or Oracle. This is so all of the enterprise's data is saved to a single central location. And then it is available for reporting, etc.
For accounting software, you would need to consider the business domain and end users. For example, if the software is to be sold to large businesses you would probably use some form of a database to save data. Otherwise a binary file would be fine, perhaps with some form of encryption if you are really paranoid.
When you say "the best way", then you have to define what you mean by "good".
The problem is that various requirements conflict with each other, therefore so you can't satisfy all of them simultaneously.
For example, if one requirement is "concurrent multi-user access to the data" then this suggests using a database engine, but that conflicts with "as small as possible" and "minimize dependencies on 3rd-party software".
If a requirement is "portable data format" then this suggests XML, but that conflicts with "compact" and "indexed".
Do they store it in a XML file, text file, SQLite?
Yes.
Also, Binary files and relational databases.
Anything else?