I am new to informatics, I have created a mapping that using expression and sorter transformation to load multiple files into one single file which have 2 columns
1 data
2 seq number
All 10 files have random sequence numbers Like
example:
file1
erfef 3
abcdn 1
file 2
wewewr 4
wderfv 5
and so on till 10 files.
Expression transformation code is :
INTEGER(LTRIM(RTRIM(seq_num)),TRUE)
what I want is to load the file into one big file and sort it according to the seq number.
Got data in output file but number with incorrect seq number.
How to get data in the final table with a correct sequence number.
doing exactly what is mention in the below solution but still getting wrong output. getting output like:
erfef 3
abcdn 1
wewewr 4
wderfv 5
where as it should be like below:
where as it should be like
abcdn 1
erfef 3
wewewr 4
wderfv 5
Thanks in Advance !!!
Use indirect file load using a list of files to load all files together. Then use sorter on col2 to order the data. Finally use a target file to store data.
Whole mapping should be like this -
SQ --> EXP--> SRT(key = col2) --> Target
Few things to note -
In the session, use indirect file and use a list file name - mention filelist1.txt
Use ls -1 file* >filelist1.txt in pre session command task to create a file list with all required files.
Expression transformation- convert the col2 to INTEGER if its coming up as string in SQ.
Sorter transformation- use col2 as key column.
Using indirect file source is one way.
Another way is to use command as source and specify a command that will spit out data from all the files, like cat file*.csv.
Just change the Input Type to Command and provide the command - all this can be set by editing session -> mapping tab -> Source -> properties.
Here's an example session:
As i am new to AWS services, Actually i am trying to create the athena table from s3 bucket .csv file and also created the crawler for that. In my Csv file originally i have the below input data.
name designation zip code build no address
1 siddarth,james professor 522135 3 mla colony
2 roy,deshmukh software 412230 1 sez apartments
3 viliam,mckesson accountant 628139 10 oakland road
after creating the table in athena, i am getting below output.
name designation zip_code build_no address
1 siddarth james professor
2 roy deshmukh software
3 viliam mckesson contractor
As data is not populating as proper format data whatever csv file i have. but my required output should be like :
name designation zip code build no address
siddarth,james professor 522135 3 mla colony
roy,deshmukh software 412230 1 sez apartments
viliam,mckesson accountant 628139 10 oakland road
Could anyone help for to create the table in athena by using the s3 bucket .csv file format data.
Could you provide a plain text example line of your CSV? What's shown above are certainly not valid comma separated values and it's hard to give advice based on that.
Anyways, it looks like your first column contains commas. So it's all about how your values are quoted / special characters are escaped in order to be processed correctly.
Have a look at OpenCSVSerDe and respective documentation here. This should help...
I am new on weka. I have a dataset in csv with 5000 samples. here 20 samples of it; when I upload this dataset into weka, it looks ok, but when I run knn algorithm it gives a result that is not supposed to give. here is the sample data.
a,b,c,d
74,85,123,1
73,84,122,1
72,83,121,1
70,81,119,1
70,81,119,1
69,80,118,1
70,81,119,1
70,81,119,1
76,87,125,1
76,87,125,1
82,92,146,2
74,86,140,2
68,80,134,2
64,76,130,2
64,75,132,2
83,96,152,2
72,85,141,2
71,83,141,2
69,81,139,2
65,79,137,2
here is the result :
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.6148
Mean absolute error 0.2442
Root mean squared error 0.4004
Relative absolute error 50.2313 %
Root relative squared error 81.2078 %
Total Number of Instances 5000
it is supposed to give this kind of result like:
Correctly classified instances: 69 92%
Incorrectly classified instances: 6 8%
What should be the problem? What am I missing? I did this in all other algorithms but they all give the same output. I have used sample weka datasets, they all work as expected.
The IBk algorithm can be used for regression (predicting the value of a numeric response for each instance) as well as for classification (predicting which class each instance belongs to).
It looks like all the values of the class attribute in your dataset (column d in your CSV) are numbers. When you load this data into Weka, Weka therefore guesses that this attribute should be treated as a numeric one, not a nominal one. You can tell this has happened because the histogram in the Preprocess tab looks something like this:
instead of like this (coloured by class):
The result you're seeing when you run IBk is the result of a regression fit (predicting a numeric value of column d for each instance) instead of a classification (selecting the most likely nominal value of column d for each instance).
To get the result you want, you need to tell Weka to treat this attribute as nominal. When you load the csv file in the Preprocess tab, check Invoke options dialog in the file dialog window. Then when you click Open, you'll get this window:
The field nominalAttributes is where you can give Weka a list of which attributes are nominal ones even if they look numeric. Entering 4 here will specify that the fourth attribute (column) in the input is a nominal attribute. Now IBk should behave as you expect.
You could also do this by applying the NumericToNominal unsupervised attribute filter to the already loaded data, again specifying attribute 4 otherwise the filter will apply to all the attributes.
The ARFF format used for the Weka sample datasets includes a specification of which attributes are which type. After you've imported (or filtered) your dataset as above, you can save it as ARFF and you'll then be able to reload it without having to go through the same process.
This is how the data is stored in the file given. There are 8 attributes given.
I need association rule mining done using Apriori Algorithm in WEKA.
Such as, if item 1 & item 2 are bought --> item 4 is also bought or something reasonable as that.
What I tried:
Converting the file into .arff format and loading into weka.
Turning all attributes in Nominal and running the algorithm Apriori.
But the rules generated are very weird.
This is how the result comes. It has no proper information. No rules like what i want, which actually define what a user will buy with what or anything.
i.e. the rules generated here are of no information to me, there is no relation/rules given as to which item will be bought with what.
How should I preprocess this data to format it well or if am making any other mistake would be really appreciated.
I would like to use Apriori to carry out affinity analysis on transaction data. I have a table with a list of orders and their information. I mainly need to use the OrderID and ProductID attributes which are in the following format
OrderID ProductID
1 A
1 B
1 C
2 A
2 C
3 A
Weka requires you to create a nominal attribute for every product ID and to specify whether the item is present in the order using a true or false value like like this:
1, TRUE, TRUE, TRUE
2, TRUE, FALSE, TRUE
3, TRUE, FALSE, FALSE
My dataset contains about 10k records... about 3k different products. Can anyone suggest a way to create the dataset in this format? (Besides a manually time consuming way...)
How about writing a script to convert it?
Should be less than 10 lines in a good scripting language such as Python.
Or you may look into options of pivoting the relation as desired.
Either way, it is a straight forward programming task, so I don't see your question here.
You obviously need to convert your data. Easiest way: write a software that read the file in the programming language that you are most familiar with and then write the file in the appropriate format. Since it is text files, it should not be too complicated.
By the way, if you want more algorithms for pattern mining and association mining than just Apriori in Weka, you could check my software SPMF ( http://www.philippe-fournier-viger.com/spmf/ ) which is also in Java, can read ARFF files too and offers about 50 algorithms specialized in pattern mining (Apriori FPGrowth, and many others.
Your data is formatted correctly as-is for implementation in R using the ARULES package (and apriori function). You might consider checking it out, esp. if you're not able to get into script coding.