mapreduce.input.fileinputformat.split.minsize not work - mapreduce

I have a mapreduce job.
my dfs.blockSize is 134217728 (128M). I have a very huge hive table which has 189 blocks, I don't want to create 189 mappers (consume too much memory).
I set mapreduce.input.fileinputformat.split.minsize=268435456 (256M), but the job still create 189 to run. Suppose this setting will reduce mapper number, but it didn't work.
Appreciate any help, thanks.

I met the same problem, you can try the others configuration, such as mapreduce.input.fileinputformat.split.minsize.per.node and mapreduce.input.fileinputformat.split.minsize.per.rack, I works for me when I set these two configuration.

Related

Is there a way to specify the number of mappers in Scalding?

I am new to scalding world. My scalding job will have multiple stages, and I need to tune each stage individually.
I have found that we might be able to change the number of reducers by using withReducers. Also, I am able to set the split size for the input data by the job config. However, I didn't see there is any way to change the number of mappers for my sub-tasks on the fly.
Did I miss something? Does anyone know how to specify the number of mappers for my sub-tasks? Thanks.
Got some answers/ideas might be helpful for someone else who shared the same question.
It is much easier to control reducers compared to mappers.
Mappers are controlled by hadoop without a similar simple knob. You can set some config parameters to give hadoop an idea of how many map tasks to launch.
This stack overflow may be helpful:
Setting the number of map tasks and reduce tasks
One workaround I could think of is changing your major task to small ones, which you could individually tweak the size (# of mappers) of your input data.

Google AutoML Importing text items very slow

I'm importing text items to Google's AutoML. Each row contains around 5000 characters and I'm adding 70K of these rows. This is a multi-label data set. There is no progress bar or indication of how long this process will take. Its been running for a couple of hours. Is there any way to calculate time remaining or total estimated time. I'd like to add additional data sets, but I'm worried that this will be a very long process before the training even begins. Any sort of formula to create even a semi-wild guess would be great.
-Thanks!
I don't think that's possible today, but I filed a feature request [1] that you can follow for updates. I asked for both training and importing data, as for training it could be useful too.
I tried training with 50K records (~ 300 bytes/record) and the load took more than 20 mins after which I killed it. I retried with 1K, which ran for 20 mins and then emailed me an error message saying I had multiple labels per input (yes, so what? training data is going to have some of those) and I had >100 labels. I simplified the classification buckets and re-ran. It took another 20 mins and was successful. Then I ran 'training' which took 3 hours and billed me $11. That maps to $550 for 50K recs, assuming linear behavior. The prediction results were not bad for a first pass, but I got the feeling that it is throwing a super large neural net at the problem. Would help if they said what NN it was and its dimensions. They do say "beta" :)
don't wast your time trying to using google for text classification. I am a GCP hard user but microsoft LUIS is far better, precise and so much faster that I can't believe that both products are trying to solve same problem.
Luis has a much better documentation, support more languages, has a much better test interface, way faster.. I don't know if is cheaper yet because the pricing model is different but we are willing to pay more.

How Hadoop calculate physical memory and virtual memory during a job execution

I have few queries related to the counters used in Hadoop to display memory usage.
A map reduce job executed on a cluster gives me below menitoned counter values. Input file used is just in KBs, but these counter shows 35GB and 420 GB usage.
PHYSICAL_MEMORY_BYTES=35110662144
VIRTUAL_MEMORY_BYTES=420121841664
For another different job on same input file it shows 309 MB (physical) and 3G(vitual) usage
PHYSICAL_MEMORY_BYTES=309526528
VIRTUAL_MEMORY_BYTES=3435827200
First job is more CPU intensive than other and creates more objects than the other one but still its usage shown seems very high.
So I just wanted to know how this memory usage is calculated. I tried going through some posts and gave an over view on this below link which seems to be
requirement task for describing these variables (https://issues.apache.org/jira/i#browse/MAPREDUCE-1218 ) but couldnt find how these are calculated. It does gives me an idea on how these values are passed to Job Tracker,but no information on how these are determined. So if some one could give some insight on this than it would be really helpfull.
You can find few references here and here. The second link in particular to map and reducer job and how slots are decided based on memory allocations. Happy Learning

Inner workings of an elastic search?

I want to learn how elasticsearch works. I got concerns about scalability of my design. I have got 50 million documents. Every document has got around 50 string properties,45 integer properties and 5 datetime properties.
So my concerns are When I query ES with a predicate containing 8 fields with 3 sortings based on date and integer values. How does ES perform? What happens in the background so I ensure the performance when system reaches 500 million?
The link blackpop provided in the comment is a good start to understand whats going on. But you don't need to understand everything to make things work. The good thing on elasticsearch is - it's elastic. Meaning, it scales very well, so if you need more performance you just add more RAM/CPU/Server and maybe config a cluster (well, at least then you should learn something about shards and nodes).
Btw, your scenario seems not to be very hard task for lucene (on which ES is based), if you need performing queries under a second or so. We use similiar settings with > 200 M docs on one lone middle range server (around 2500 euro). I would encourage you to make real live tests on your desktop/laptop indexing 50 M dox. We did this, too.

Which is the fastest way to retrieve all items in SQLite?

I am programming on windows, I store my infors in sqlite.
However I find to get all items is a bit slow.
I am using the following way:
select * from XXX;
Retrieving all items in 1.7MB SQLite DB takes about 200-400ms.
It is too slow. Can anyone help?
Many Thanks!
Thanks for your answers!
I have to do a complex operation on the data, so everytime, when I open the app, I need to read all information from DB.
I would try the following:
Vacuum your database by running the "vacuum" command
SQLite starts with a default cache size of 2000 pages. (Run the command "pragma cache_size" to be sure. Each page is 512 bytes, so it looks like you have about 1 MByte of cache, which is not quite enough to contain your database. Increase your cache size by running "pragma default_cache_size=4000". That should get you 2 Mbytes cache, which is enough to get your entire database into the cache. You can run these pragma commands from the sqlite3 command line, or through your program as if it were another query.
Add an index to your table on the field you are ordering with.
You could possibly speed it up slightly by selecting only those columns you want, but otherwise nothing will beat an unordered select with no where clause for getting all the data.
Other than that a faster disk/cpu is your only option.
What type of hardware is this on?