joiner data and index cache - informatica

Assuming that the Master Pipeline has 5 fields, two of which are part of the join condition and that all fields are connected downstream to the next transformation, how many fields are in the index and data cache files?
I am confused in the below 2 answers. Please Check.
5 fields are in the index cache and 3 fields in the data cache.
OR
2 fields are in the index cache and 3 fields in the data cache.

Answer 2 is correct. So index cache holds columns that are used in join condition and all output columns to data cache. Sequence wise, first it populates detail data into cache and then compares master data with that details data. Did some testing and this is the result -
With 2 join conditions and 2 output ports session log result -
The index cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [121856] bytes
The data cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [62568] bytes
With 2 join condition 1 output ports session log result -
The index cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [121856] bytes
The data cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [1608] bytes
With 1 join condition and 4 output ports session log result -
The index cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [61952] bytes
The data cache size that would hold [99] input rows from the master for [JNR_Level1], in memory, is [183744] bytes
Koushik

Related

Calculating RCUs for small objects for dynamoDBs

Say we have a table with average item size of 1 KB. We perform a query which reads 3 such items. Now according to what I have read, the number of RCUs should be (strongly consistent reads) :
(Number of items read) * ceil(item_size/4) = 3 * ceil(1/4) = 3*1 = 3.
So wanted to confirm : is this correct? Or do we use a single RCU as total size of messages read is 3, which is less than 4.
An RCU is good for 1 strongly consistent read of up to 4KB.
Thus you can query() four 1KB items for 1 RCU.
Since you have only 3 to read, 1 RCU will be consumed.
Using GetItem() to get those same 3 records would cost 3 RCU.
Let say you had 100 items that matched (HK+SK) the query, but you're also using filter to further select records to be returned; so you're only getting 4 records back. That query would take 25 RCU, as the records still have to be read even if not returned.
Reference can be found here :
Query—Reads multiple items that have the same partition key value. All items returned are treated as a single read operation, where DynamoDB computes the total size of all items and then rounds up to the next 4 KB boundary. For example, suppose your query returns 10 items whose combined size is 40.8 KB. DynamoDB rounds the item size for the operation to 44 KB. If a query returns 1500 items of 64 bytes each, the cumulative size is 96 KB.

why AWS file size is different between Redshift and S3?

I'm UNLOADing tables from Redshift to S3 for backup. So I am checking to make sure the files are complete if we need them again.
I just did UNLOAD on a table that has size = 1,056 according to:
select "table", size, tbl_rows
FROM svv_table_info;
According to the documentation, the size is "in 1 MB data blocks", so this table is using 1,056 MB. But after copying to S3, the file size is 154 MB (viewing in AWS console).
I copied back to Redshift and all the rows are there, so this has to do with "1 MB data blocks". This is related to how it's saved in the file system, yes?
Can someone please explain? Thank you.
So you're asking why the SVV_TABLE_INFO view claims that the table consumes 1 GB, but when you dump it to disk the result is only 154 MB?
There are two primary reasons. The first is that you're actively updating the table but not vacuuming it. When a row is updated or deleted, Redshift actually appends a new row (yes, stored as columns) and tombstones the old row. To reclaim this space, you have to regularly vacuum the table. While Redshift will do some vacuuming in the background, this may not be enough, or it may not have happened at the time you're looking.
The second reason is that there's overhead required to store table data. Each column in a table is stored as a list of 1 MB blocks, one block per slice (and multiple slices per node). Depending on the size of your cluster and the column data type, this may lead to a lot of wasted space.
For example, if you're storing 32-bit integers, one 1MB block can store 256,000 such integers, requiring a total of 4 blocks to store 1,000,000 values (which is probably close to number of rows in your table). But, if you have a 4-node cluster, with 2 slices per node (ie, a dc2.large), then you'll actually require 8 blocks, because the column will be partitioned across all slices.
You can see the number of blocks that each column uses in STV_BLOCKLIST.

How map reduce program process fragmented data between two nodes

I want to understand how mapreduce job process the data blocks. As per my understanding for each data block one mapper is invoked.
Let me put my query with one example.
Suppose, I have a long text file having data stored in HDFS in 4 blocks ( 64 MB ) on 4 nodes (let's forget about the replication here )
In this case 4 map task will be invoked on each machine ( all 4 data nodes/machines)
Here is question : this splitting may have resulted partial record stored on two blocks. Like last record may have been get stored in block 1(at end) partially and other part on block 2.
In this case, how does mapreduce program ensure that complete record is getting processed?
I hope, I have been able to put my query
Please read this article at http://www.hadoopinrealworld.com/inputsplit-vs-block/
This is an excerpt from the same article I posted above.
A block is a hard division of data at the block size. So if the block size in the cluster is 128 MB, each block for the dataset will be 128 MB except for the last block which could be less than the block size if the file size is not entirely divisible by the block size. So a block is a hard cut at the block size and blocks can end even before a logical record ends.
Consider the block size in your cluster is 128 MB and each logical record is your file is about 100 Mb. (yes.. huge records)
So the first record will perfectly fit in the block no problem since the record size 100 MB is well with in the block size which is 128 MB. However the 2nd record can not fit in the block, so the record number 2 will start in block 1 and will end in block 2.
If you assign a mapper to a block 1, in this case, the Mapper can not process Record 2 because block 1 does not have the complete record 2. That is exactly the problem InputSplit solves. In this case InputSplit 1 will have both record 1 and record 2. InputSplit 2 does not start with Record 2 since Record 2 is already included in the Input Split 1. So InputSplit 2 will have only record 3. As you can see record 3 is divided between Block 2 and 3 but still InputSplit 2 will have the whole of record 3.
Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of data. Inputsplit is a Java class with pointers to start and end locations in blocks. So when Mapper tries to read the data it clearly knows where to start reading and where to stop reading. The start location of an InputSplit can start in a block and end in another block.
Happy Hadooping

DynamoDB - How to calculate the read throughput for queries

Consider the following case:
I have a table with read and write capacity both set to 100. Assume the table has 10000 entries and each entry is 0.5KB.
With this, I can read 100 records of 4KB each and write 100 records of 1KB each per second.
From the AWS docs
You can use the Query and Scan operations to retrieve multiple
consecutive items from a table or an index, in a single request. With
these operations, DynamoDB uses the cumulative size of the processed
items to calculate provisioned throughput. For example, if a Query
operation retrieves 100 items that are 1 KB each, the read capacity
calculation is not (100 × 4 KB) = 100 read capacity units, as if those
items had been retrieved individually using GetItem or BatchGetItem.
Instead, the total would be only 25 read capacity units, as shown
following:
(100 * 1024 bytes = 100 KB) / 4 KB = 25 read capacity units
I want to issue a query (using the hash key and range key unspecified) and it'll retrieve say 1000 items.
So the cumulative size of the returned records is 1000 * 0.5KB = 500KB.
Question:
Should the read throughput be 500/4 = 125?
or 100(or anything around 80) is sufficient because the Query is not going to complete in one second
How can I determine the throughput for this(Query) case?
Thanks..
When you run a query or a scan, you consume reads based on the size of the data scanned or queried, not the number of records. If you query 500KB using the strongly consistent reads it will consume 125 read capacity units.
There is an option ReturnConsumedCapacity that will return the consumed read capacity along with your data.

Does the number of map tasks spwaned depends on the number of jobnodes?

The number of map() spawned is equal to the number of 64MB blocks of input data. Suppose we have 2 input files of 1MB size, both the files will be stored in a single block. But when I run my MR program with 1 namenode and 2 jobnodes, I see 2 map() spawned, one for each file. So is this because the system tried to split the job between 2 nodes i.e.,
Number of map() spawned = number of 64MB blocks of input data * number of jobnodes ?
Also, in the mapreduce tutorial, its written than for a 10TB file with blocksize being 128KB, 82000 maps will be spawned. However, according to the logic that number of maps is only dependent on block size, 78125 jobs must be spawned (10TB/128MB). I am not understanding how few extra jobs have been spawned? It will be great if anyone can share your thoughts on this? Thanks. :)
By Default one mapper per input file is spawned and if the size of input file is greater than than split size(which is normally kept same as block size) then for that file number of mappers will be ceil of filesize/split size.
Now say you 5 input files and split size is kept as 64 MB
file1 - 10 MB
file2 - 30 MB
file3 - 50 MB
file4 - 100 MB
file5 - 1500 MB
number of mapper launched
file1 - 1
file2 - 1
file3 - 1
file4 - 2
file5 - 24
total mappers - 29
Additionally, input split size and block size is not always honored. If input file is a gzip, it is not splittable. So if one of the gzip file is 1500mb, it will not be split. It is better to use Block compression with Snappy or LZO along with sequence file format.
Also, input split size is not used if input is HBASE table. In case of HBase table, only to split is to maintain correct region size for the table. If table is not properly distributed, manually split the table into multiple regions.
Number of mappers depends on just one thing, the no of InputSplits created by the InputFormat you are using(Default is TextInputFormat which creates splits taking \n as the delimiter). It does not depend on the no. of nodes or the file or the block size(64MB or whatever). It's very good if the split is equal to the block. But this is just an ideal situation and cannot be guaranteed always. MapReudce framework tries its best to optimise the process. And in this process things like creating just 1 mapper for the entire file happen(if the filesize is less than the block size). Another optimization could be to create lesser number of mappers than the number of splits. For example if your file has 20 lines and you are using TextInputFormat then you might think that you'll get 20 mappers(as no. of mappers = no. of splits and TextInputFormat creates splits based on \n). But this does not happen. There will be unwanted overhead in creating 20 mappers for such a small file.
And if the size of a split is greater than the block size, the remaining data is moved in from the other remote block on a different machine in order to gets processed.
About the MapReduce tutorial :
If you have 10TB data, then -
(10*1024*1024)/128 = 81,920 mappers, which almost = 82,000
Hope this clears some of the things.