appropriate data for Fp-Growth and association rules - data-mining

I want to choose appropriate dataset for doing fpGrowth and extracting association rules.I know that relational dataset and transactional dataset is appropriate for this task, but I want to know generally what kind of dataset is appropriate for this task ?

You need item sets. No duplicates allowed, no order.
E.g. butter, milk, bread - it does not matter how much milk.
Also it is advisable to aggregate product categories instead of individual items, i.e. any kind of milk is considered to be the same.

Related

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.
The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.
For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan
If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

Merge cells with similar but different data, different spelling

I am trying Tableau with data extracted from Salesforce. The input includes a "Country" record were the row have different spellings for the same thing.
Example: Cananda, CANADA, CAnada etc.
Is there a way to fix this in Tableau?
The easiest solution is create a group field based on your Country field.
Select Country in the data pane on the left side bar, right click and choose Create Group. Select elements that you want to group together put them into a single group, say Canada, that contains all variations of spelling.
This new group field initially has a name of Country (group). You may want to rename it Country_Corrected. (Or even better, rename the first field, Country_Original, and call the group field simply Country. Then you can hide Country_Original)
Groups are implemented using SQL case statements. They have many uses, but one application is to easily tolerate some inconsistent spellings in your data source without having to change your data. In general, you can specify several transformations like this that take effect at query and visualization time. For very large data sets, or for very complicated transformations, you may eventually want to push some of them upstream in your data pipeline to get better performance. But make those optimizations later when you've proven the necessity.
If the differences are just in case (upper vs lower), you can right-click the Country dimension, and create a calculated field called something like "New Country", and use the following formula to make the case consistent:
upper([Country])
Use this new "New Country" calc dimension instead of your "Country" dimension, and it will group them all without case sensitivity, and display as uppercase. Or you can use "lower" instead of "upper" if preferred.

Redshift: Should the sortkey contain the distkey?

We have customer data that is sharded by a company ID. That is, no companies data would ever mix with another companies data so this was chosen as the distkey.
Should the company ID be the first column in the sortkey given that a node may contain several thousand companies? Or does the distkey already limit the data to a given company before it starts scanning?
Dist key does not affect the order in which rows are stored in each node/slice/block. Sort key (or natural order in the absence of such) defines the order.
If you expect frequent queries with company_id and you want to achieve maximum performance, make company_id the main sort key (COMPOUND or default, not just INTERLEAVED).
I'd also advise familiarising yourself with the SVL_QUERY_REPORT view. It can tell you whether full-scan was used (or range-restricted when using optimal sort keys), against which slices, and how many rows were actually scanned. Try different table layouts for the same data, and not only look at query times, but also confirm from this report that Redshift does what you expect it to do.

Cassandra NOT EQUAL Operator

Question to all Cassandra experts out there.
I have a column family with about a million records.
I would like to query these records in such a way that I should be able to perform a Not-Equal-To kind of operation.
I Googled on this and it seems I have to use some sort of Map-Reduce.
Can somebody tell me what are the options available in this regard.
I can suggest a few approaches.
1) If you have a limited number of values that you would like to test for not-equality, consider modeling those as a boolean columns (i.e.: column isEqualToUnitedStates with true or false).
2) Otherwise, consider emulating the unsupported query != X by combining results of two separate queries, < X and > X on the client-side.
3) If your schema cannot support either type of query above, you may have to resort to writing custom routines that will do client-side filtering and construct the not-equal set dynamically. This will work if you can first narrow down your search space to manageable proportions, such that it's relatively cheap to run the query without the not-equal.
So let's say you're interested in all purchases of a particular customer of every product type except Widget. An ideal query could look something like SELECT * FROM purchases WHERE customer = 'Bob' AND item != 'Widget'; Now of course, you cannot run this, but in this case you should be able to run SELECT * FROM purchases WHERE customer = 'Bob' without wasting too many resources and filter item != 'Widget' in the client application.
4) Finally, if there is no way to restrict the data in a meaningful way before doing the scan (querying without the equality check would returning too many rows to handle comfortably), you may have to resort to MapReduce. This means running a distributed job that would scan all rows in the table across the cluster. Such jobs will obviously run a lot slower than native queries, and are quite complex to set up. If you want to go this way, please look into Cassandra Hadoop integration.
If you want to use not-equals operator on a specific partition key and get all other data from table then you can use a combination of range queries and TOKEN function from CQL to achieve this
For example, if you want to fetch all rows except the ones having partition key as 'abc' then you execute below 2 queries
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) < TOKEN('abc');
select <column1>,<column2> from <keyspace1>.<table1> where TOKEN(<partition_key_column_name>) > TOKEN('abc');
But, beware that result is going to be huge (depending on size of table and fields you need). So you might want to use this in conjunction with dsbulk kind of utility. Also note that there is no guarantee of ordering in your result. This is just a kind of data dump which will most probably be useful for some kind of one-time data migration like scenarios.