How can i assign perticular size to datanode in HDFS?

How can i assign perticular size to datanode in HDFS? - hdfs

My current datanode size is 128MB which is by default and
i want to change the size of that because in my project
there are lots of text files are going to store in hdfs
which means size of those files are very less(less than 10M)
so i want to change the size of block. Give me some useful
suggestions and comments. I already tried some tricks but
it was not useful.

Write this property into hdfs-site.xml file between configuration tag.
67108864 bytes= 64mb.
Write size as per your requirement in bytes.
dfs.block.size 67108864

Related

Apache Hadoop: Insert compress data into HDFS

I need to upload 100 text files into HDFS to do some data transformation with Apache Pig.
In you opinion, what is the best option:
a) Compress all the text files and upload only one file,
b) Load all the text files individually?

It depends - on your files size, cluster parameters and processing methods.
If your text files are comparable in size with HDFS block size (i.e. block size = 256 MB, file size = 200 MB), it makes sense to load them as is.
If your text files are very small, there would be typical HDFS & small files problem - each file will occupy 1 hdfs block (not physically), so NameNode (which handles metadata) will suffer some overhead on managing lot of blocks. To solve this you could either merge your files into single one, use hadoop archives (HAR) or some custom file format (Sequence Files for example).
If custom format is used, you will have to do extra work with processing - it will be required to use custom input formats.
In my opinion, 100 is not that much to significantly affect NameNode performance, so both options seem to be viable.

Pig in local mode on a large file

I am running pig in local mode on a large file 54 GB. I observe it spawning lot of map tasks sequentially . What I am expecting is that maybe each map task is reading 64 MB worth of lines. So if I want to optimize this and maybe reads 1GB equivalent number of lines,
a.) Is it possible?(Maybe by increasing split size)
b.) How?
c.) IS there any other optimum approach.
thanks

You can increase split size by setting:
SET mapred.max.split.size #bytes

By default block size is 64MB.
Try this to increase the block size:
Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory.Set the following property in hdfs-site.xml:
-property-
-name-dfs.block.size-name-
-value-134217728-value-
-description-Block size-description-
-property-

How to increase Mappers and Reducer in Apache TEZ

I know this simple question, I need some help on this query from this community, When I create PartitionTable with ORC format, When I try to dump data from non partition table which is pointing to 2 GB File with 210 columns, I see Number of Mapper are 2 and reducer are 2 . is there a way to increase Mapper and reducer. My assumption is we cant set number of Mapper and reducer like MR 1.0, It is based on Settings like Yarn container size, Mapper minimum memory and maximum memory . can any one suggest me TEz Calculates mappers and reducers. What is best value to keep memory size setting, so that i dont come across : Java heap space, Java Out of Memory problem. My file size may grow upto 100GB. Please help me on this.

You can still set the number of mappers and reducers in Yarn. Have you tried that? If so, please get back here.
Yarn changes the underlying execution mechanism, but #mappers and #reducers is describing the Job requirements - not the way the job resources are allocated (which is how yarn and mrv1 differ).
Traditional Map/Reduce has a hard coded number of map and reduce "slot". As you say - Yarn uses containers - which are per-application. Yarn is thus more flexible. But the #mappers and #reducers are inputs of the job in both cases. And also in both cases the actual number of mappers and reducers may differ from the requested number. Typically the #reducers would either be
(a) precisely the number that was requested
(b) exactly ONE reducer - that is if the job required it such as in total ordering

For the memory settings, if you are using hive with tez, the following 2 settings will be of use to you:
1) hive.tez.container.size - this is the size of the Yarn Container that will be used ( value in MB ).
2) hive.tez.java.opts - this is for the java opts that will be used for each task. If container size is set to 1024 MB, set java opts to say something like "-Xmx800m" and not "-Xmx1024m". YARN kills processes that use more memory than specified container size and given that a java process's memory footprint usually can exceed the specified Xmx value, setting Xmx to be the same value as the container size usually leads to problems.

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.

the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.

Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'

You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.

on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

implementing block storage for a database

For a database class, we are implementing our own database, and I am having trouble how to implement block storage in C++ (where each block is 1024 bytes).
We are to store each database table as a randomly accessible collection of blocks on the hard disk, where the first block is a file header, dedicated to meta data (block 0), and each subsequent block is dedicated to storing the rows of the table. The blocks are to be written to the hard disk as files. We are also to have one block as an "in-memory" buffer; we can read and edit the data in the buffer, and when we are ready, we write the in-memory buffer back to disk.
I think I am OK conceptualizing the in-memory buffer, but I am having trouble how to write the blocks of memory to files. I have two ideas, each with their own difficulties:
Idea 1
Create a class MemoryBlock that is exactly 1024 bytes. Each MemoryBlock can store arbitrary data (file header or rows of the table). Store each table as a single file by writing the array of MemoryBlocks to the file.
Difficulty:
Can I update a single block in the middle of the file? It is my understanding that files must be overwritten or appended to. If I have a file of 3 MemoryBlocks (blocks 0-2), and I want to update a row that is in block 1, can I just pull the block 1 into my buffer, edit it, and write it back to the middle of the file, or would I have to pull the entire file into memory, edit what I want to, and then overwrite the original file?
Idea 2
Store each block as a separate file on disk. This would allow me to randomly access any block and write it back to disk without having to worry about the rest of the table
Difficulty: I'm not sure if this is really enforcing the 1024 byte block size. Is there any way to require that each file does not exceed 1024 bytes?
I am not married to either idea, but I am appreciative of any input that helps me better understand block storage in database management systems.
Edit: As #zaufi points out, 1024 byte block sizes are very atypical. I meant to type 4096 byte blocks when writing this.

ohh man, you definitely need to read smth about databases internals...
here is my 5 cents: both ideas are bad! Why you decided to use 1024 bytes blocks??? Physical sector size for modern HDD is 4096 bytes! Disk controllers have cache 4M-6M-8M-16M-... So writing 1K is just a wasting resources...
and btw, updating smth in the middle of the file is always bad idea... but if performance is not your concern, you can definitely do...
before reinvent the wheel try to research typical approaches used in various DMBS...
one more good (simple) source to read: google about leveldb and firends... -- this will definitely give you ideas!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js