How can I partition an Arrow Table by value in one pass? - apache-arrow

I would like to be able to partition an Arrow table by the values of one of its columns (assuming the set of n values occurring in that column is known). The straightforward way is a for-loop: for each of these values, scan the whole table and build a new table of matching rows. Are there ways to do this in one pass instead of n passes?
I initially thought that Arrow's support for group-by scans would be the solution -- but Arrow (in contrast to Pandas) does not support extracting groups after a group-by scan.
Am I just thinking about this wrong and there is another way to partition a table in one pass?

For the group by support, there is a "hash_list" function that returns all values in the group. Is that what you're looking for? You could then slice the resulting values after-the-fact to extract the individual groups.

Related

DynamoDB Query distinct attribute values

I'm trying to query DynamoDB and get a result similar to select distinct(address) from ... in SQL.
I know DynamoDB is a document-oriented DB and maybe I need to change the data structure.
I'm trying to avoid getting all the data first and filtering later.
My data looks like this:
Attribute
Datatype
ID
String
Var1
Map
VarN
Map
Address
String
So I want to get the distinct addresses in the entire table.
How it's the best way to do it?
Unfortunately, no. You'll need to Scan the entire table (you can use the ProjectionExpression or AttributesToGet options to ask just for the "Address" attribute, but anyway you'll pay for scanning the entire contents of the table).
If you need to do this scan often, you can add a secondary-index which projects only the keys and the "Address" attribute, to make it cheaper to scan. But unfortunately, using a GSI whose partition key is the "Address" does not give you an ability to eliminate duplicates: Each partition will still contain a list of duplicate items, and unfortunately there is no way to just listing the different partition keys in an index - Scaning the index will give you the same partition key multiple times, as many items there are in this partition.

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.
The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.
For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan
If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

Querying DynamoDB with a partition key and list of specific sort keys

I have a DyanmoDB table that for the sake of this question looks like this:
id (String partition key)
origin (String sort key)
I want to query the table for a subset of origins under a specific id.
From my understanding, the only operator DynamoDB allows on sort keys in a Query are 'between', 'begins_with', '=', '<=' and '>='.
The problem is that my query needs a form of 'CONTAINS' because the 'origins' list is not necessarily ordered (for a between operator).
If this was SQL it would be something like:
SELECT * from Table where id={id} AND origin IN {origin_list}
My exact question is: What do I need to do to achieve this functionality in the most efficient way? should I change my table structure? maybe add a GSI? Open to suggestions.
I am aware that this can be achieved with a Scan operation but I want to have an efficient query. Same goes for BatchGetItem, I would rather avoid that functionality unless absolutely necessary.
Thanks
This is a case for using Filter Expressions for Query. It has IN operator
Comparison Operator
a IN (b, c, d) — true if a is equal to any value in the list — for
example, any of b, c or d. The list can contain up to 100 values,
separated by commas.
However, you cannot use condition expressions on key attributes.
Filter Expressions for Query
A filter expression cannot contain partition key or sort key
attributes. You need to specify those attributes in the key condition
expression, not the filter expression.
So, what you could do is to use origin not as a sort key (or duplicate it with another attribute) to filter it after the query. Of course filter first reads all the items has that 'id' and filters later which consumes read capacity and less efficient but there is no other way to query that otherwise. Depending on your item sizes and query frequency and estimated number of returned items BatchGetItem could be a better choice.

Redshift: Does key-based distribution optimize equality filters?

This documentation describes key-distribution in redshift as follows:
The rows are distributed according to the values in one column. The
leader node will attempt to place matching values on the same node
slice. If you distribute a pair of tables on the joining keys, the
leader node collocates the rows on the slices according to the values
in the joining columns so that matching values from the common columns
are physically stored together.
I was wondering if key-distribution additionally helps in optimizing equality filters. My intuition says it should but it isn't mentioned anywhere.
Also, I saw a documentation regarding sort-keys which says that to select a sort-key:
Look for columns that are used in range filters and equality filters.
This got me confused since sort-keys are explicitly mentioned as a way to optimize equality filters.
I am asking this because I already have a candidate sort-key on which I will be doing range queries. But I also want to have quick equality filters on another column which is a good distribution key in my case.
It is a very bad idea to be filtering on a distribution key, especially if your table / cluster is large.
The reason is that the filter may be running on just one slice, in effect running without the benefit of MPP.
For example, if you have a dist key of "added_date", you may find that all of the added date for the previous week are all together on one slice.
You will then have the majority of queries filtering for recent ranges of added_date, and these queries will be concentrated and will saturate that one slice.
The simple rule is:
Use DISTKEY for the column most commonly joined
Use SORTKEY for fields most commonly used in a WHERE statement
There actually are benefits to using the same field for SORTKEY and DISTKEY. From Choose the Best Sort Key:
If you frequently join a table, specify the join column as both the sort key and the distribution key.
This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
Feel free to do some performance tests -- create a few different versions of the table, and use INSERT or SELECT INTO to populate them. Then, try common queries to see how they perform.

How to Scan HBase Rows efficiently

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.
My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.
My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?
Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?
Thanks
-Panks
A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.
One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.
Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.
You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.
I am just getting started with HBase, bloom filters might help.
You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:
Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class,
OutputKey.class, OutputValue.class, job);
If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.