How to setup a csv or txt file for uploading to weka? - data-mining

How should a txt or csv file be setup for uploading to weka in order to use apriori? I have tried setting it up as a binary but the associations don't seem to come out correctly. Assuming my database transactions are simple like below what would be the correct way to create a csv or txt file for uploading to weka? The first column is the transaction id and the latter is the items for that transaction.
1 --- {M,O,N,K,E,Y}
2 --- {D,O,N,K,E,Y}
3 --- {M,A,K,E}
4 --- {C,O,O,K,I,E}
5 --- {D,O,O,D,L,E}

Weka comes with an example dataset supermarket, which contains a dataset that is in the right format for Apriori for market basket analysis (this article uses it).
Since Weka does not handle variable number of attributes per row, each item that was bought, gets a separate column. If the item was bought, then a t (= true) is stored, otherwise a ? (= missing value).
In your case, you would have to do something similar: e.g., creating a CSV spreadsheet with separate columns for each item and filling them with t if the transaction contains that item, otherwise leave it empty. For example:
id,A,C,D,E,I,K,L,M,N,O,Y
1,,,,t,,t,,t,t,t,t
2,,,t,t,,t,,,t,t,t
3,t,,,t,,t,,t,,,
4,,t,,t,t,t,,,,t,
5,,,t,t,,,t,,,t,
You can then load the dataset in the Weka Explorer and save it as ARFF (which will use ? for the missing values).
However, Apriori only handles nominal attributes and your ID attribute is numeric. You can then either delete that attribute before running Apriori or turn it into nominal attribute using the NumericToNominal filter in the Preprocess panel.

Related

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Building app to upload CSV to Oracle 12c database via Apex

I'v been asked to create an app in Oracle Apex that will allow me to drop a CSV file. The file contains a list of all active physicians and associated info in my area. I do not know where to begin! Requirements:
-after dropping CSV file to apex, remove unnecessary columns
-edit data in each field, ie if phone# > 7 characters and begins with 1, remove 1. Or remove all special characters from a column.
-The CSV contains physicians of every specialty, I only want to upload specific specialties to the database table.
I have a small amount of SQL experience from Uni, and I know some HTML and CSS, but beyond that I am lost. Please help!
Began tutorial on Oracle-Apex. Created upload wizard on a dev environment
User drops CSV file to apex
Apex edits columns to remove unneccesary characteres
Only uploads specific columns from CSV file
Only adds data when column "Specialties" = specific specialties
Does not add redundant data (physician is already located in table, do nothing)
Produces report showing all new physicians added to table
Huh, you're in deep trouble as you have to do some job using a tool you don't know at all, with limited knowledge of SQL language. Yes, it is said that Apex is simple to use, but nonetheless ... you have to know at least something. Otherwise, as you said, you're lost.
See if the following helps.
there's the CSV file
create a table in your database; its description should match the CSV file. Mention all columns it contains. Pay attention to datatypes, column lengths and such
this table will be "temporary" - you'll use it every day to load data from CSV files: first you'll delete all it contains, then load new rows
using Apex "Create page" Wizard, create the "Data loading" process. Follow the instructions (and/or read documentation about it). Once you're done, you'll have 4 new pages in your Apex application
when you run it, you should be able to load CSV file into that temporary table
That's the first stage - successfully load data into the database. Now, the second stage: fix what's wrong.
create another table in the database; it will be the "target" table and is supposed to contain only data you need (i.e. the subset of the temporary table). If such a table already exists, you don't have to create a new one.
create a stored procedure. It will read data from the temporary table and edit everything you've mentioned (remove special characters, remove leading "1", ...)
as you have to skip physicians that already exist in the target table, use NOT IN or NOT EXISTS
then insert "clean" data into the target table
That stored procedure will be executed after the Apex loading process is done; a simple way to do that is to create a button on the last page which will - when pressed - call the procedure.
The final stage is the report:
as you have to show new physicians, consider adding a column (into the target table) which will be a timestamp (perhaps DATE is enough, if you'll be doing it once a day) or process_id (all rows inserted in the same process will share the same value) so that you could distinguish newly added rows from the old ones
the report itself would be an Interactive report. Why? Because it is easy to create and lets you (or end users) to adjust it according to their needs (filter data, sort rows in a different manner, ...)
Good luck! You'll need it.

What is the best approach to load data into Hive using NiFi?

I have started working with NiFi. I am working on a use case to load data into Hive. I get a CSV file and then I use SplitText to split the incoming flow-file into multiple flow-files(split record by record). Then I use ConvertToAvro to convert the split CSV file into an AVRO file. After that, I put the AVRO files into a directory in HDFS and I trigger the "LOAD DATA" command using ReplaceText + PutHiveQL processor.
I'm splitting the file record by record because to get the partition value(since LOAD DATA doesn't support dynamic partitioning). The flow looks like this:
GetFile (CSV) --- SplitText (split line count :1 and header line count : 1) --- ExtractText (Use RegEx to get partition fields' values and assign to attribute) --- ConvertToAvro (Specifying the Schema) --- PutHDFS (Writing to a HDFS location) --- ReplaceText (LOAD DATA cmd with partition info) --- PutHiveQL
The thing is, since I'm splitting the CSV file into each record at a time, it generates too many avro files. For ex, if the CSV file has 100 records, it creates 100 AVRO files. Since I want to get the partition values, I have to split them by one record at a time. I want to know is there any way, we can achieve this thing without splitting record by record. I mean like batching it. I'm quite new to this so I am unable to crack this yet. Help me with this.
PS: Do suggest me if there is any alternate approach to achieve this use case.
Are you looking to group the Avro records based on the partitions' values, one Avro file per unique value? Or do you only need the partitions' values for some number of LOAD DATA commands (and use a single Avro file with all the records)?
If the former, then you'd likely need a custom processor or ExecuteScript, since you'd need to parse, group/aggregate, and convert all in one step (i.e. for one CSV document). If the latter, then you can rearrange your flow into:
GetFile -> ConvertCSVToAvro -> PutHDFS -> ConvertAvroToJSON -> SplitJson -> EvaluateJsonPath -> ReplaceText -> PutHiveQL
This flow puts the entire CSV file (as a single Avro file) into HDFS, then afterwards it does the split (after converting to JSON since we don't have an EvaluateAvroPath processor), gets the partition value(s), and generates the Hive DDL statements (LOAD DATA).
If you've placed the file at the location where the hive table is reading the data using the puthdfs processor then you don't need to call the puthiveql processor. I am also new to this but I think you should leverage the schema-on-read capability of hive.

Upload a csv file to a attribute-value model

Recently I am dealing with a following problem:
I am trying to build an "archive" for storing and retrieving data from various sources, so the data will always have different number of columns and rows. I think that allowing the user to create new tables just to store those CSV files (each in separate files ) would a serious violation of web development guidelines and also difficult to achieve in Django. That's why I came up with idea of attribute-value format for storing the data, but I don't know how to implement it in django.
I want to build a form in Django Admin to allow user to upload a CSV file with N-columns to a table that contains only two columns: 1)name of the column from csv file and 2) value for that column (more precisly: three value columns: one for integers, one for floats and one for storing strings. To do that I must of course "melt" the data from CSV file to a "long" format so the file:
col1 | col2 | col3
23 | 45.0 | 32
becomes:
key| val
col1| 23
col2 | 45.0
col3 | 32
And that I know how to do. However, i do not know if it is possible to process a file that is uploaded by the user to such a format and, later, how to retrive data in a simple, django-way.
Do you know of any such extensions /widegts or how to approach the problem? Or how to google it even? I have done my research, however, I found only general approaches for dynamic models and I don't think that my case requires using them:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
and here's dynamic model approach:
https://pypi.python.org/pypi/django-dynamo - however, I am not sure it's the right answer.
So my guess is that I do not understand django really that well, but I'd be grateful for some directions.
No you don't need a dynamic model. And you should avoid EAV(Entity Attribute Value)
schemas. It's bad desing.
Read here for how to process an uploaded file.
See here for how to override the save() instance method. This
is probably what you'll need to do.
Also, keep in mind that what you call melting is called serializing. It is helpful
to know the right terms and definitions when searching for these topics.

Informatica target file

I have a workflow which writes data from a table into a flatfile. It works just fine, but I want to insert a blank line inbetween each records. How can this be achieved ? Any pointer ?
Here, you can create 2 target instances. One with the proper data and in other instance pass blank line. Set Merge Type as "Concurrent Merge" in session properties.
Multiple possibilities -
You can prepare appropriate dataset into a relational table, and afterwards, dump data from that into a flat file. For preparation of that data set, you can insert blank rows into that relational target.
Send a blank line to a separate target file (based on some business condition using a router or something similar), after that you can use merge files option (in session config) to get that data into a single file.