Upload a csv file to a attribute-value model - django

Recently I am dealing with a following problem:
I am trying to build an "archive" for storing and retrieving data from various sources, so the data will always have different number of columns and rows. I think that allowing the user to create new tables just to store those CSV files (each in separate files ) would a serious violation of web development guidelines and also difficult to achieve in Django. That's why I came up with idea of attribute-value format for storing the data, but I don't know how to implement it in django.
I want to build a form in Django Admin to allow user to upload a CSV file with N-columns to a table that contains only two columns: 1)name of the column from csv file and 2) value for that column (more precisly: three value columns: one for integers, one for floats and one for storing strings. To do that I must of course "melt" the data from CSV file to a "long" format so the file:
col1 | col2 | col3
23 | 45.0 | 32
becomes:
key| val
col1| 23
col2 | 45.0
col3 | 32
And that I know how to do. However, i do not know if it is possible to process a file that is uploaded by the user to such a format and, later, how to retrive data in a simple, django-way.
Do you know of any such extensions /widegts or how to approach the problem? Or how to google it even? I have done my research, however, I found only general approaches for dynamic models and I don't think that my case requires using them:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
and here's dynamic model approach:
https://pypi.python.org/pypi/django-dynamo - however, I am not sure it's the right answer.
So my guess is that I do not understand django really that well, but I'd be grateful for some directions.

No you don't need a dynamic model. And you should avoid EAV(Entity Attribute Value)
schemas. It's bad desing.
Read here for how to process an uploaded file.
See here for how to override the save() instance method. This
is probably what you'll need to do.
Also, keep in mind that what you call melting is called serializing. It is helpful
to know the right terms and definitions when searching for these topics.

Related

Groupby existing attribute present in json string line in apache beam java

I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.

Django - Determine model fields and create model at runtime based on CSV file header

I need to determine the best approach to determine the structure of my Django app models at runtime based on the structure of an uploaded CSV file, which will then be held constant once the models are created in Django.
I have come across several questions relating to dynamically creating/altering Django models at run-time. The consensus was that this is bad practice and one should know before hand what the fields are.
I am creating a site where a user can upload a time-series based csv file with many columns representing sensor channels. The user must then be able to select a field to plot the corresponding data of that field. The data will be approximately 1 Billion rows.
Essentially, I am seeking to code in the following steps, but information is scarce and I have never done a job like this before:
User selects a CSV (or DAT) file.
The app then loads only the header row in (these files are > 4GB).
The header row is split by ",".
I use the results from 3 to create a table for each channel (columns), with the name of the field the same as the individual header entry for that specific channel.
I then load the corresponding data into the respective tables and I ahve my models for my app that will then not be changed again.
Another option I am considering is creating model with 10 fields, as I know there will never be more than 10 channels. Then reading my CSV into the table when a user loads a file, and just having those fields empty.
Has anyone had experience with similar applications?
That are allot of records, never worked with so many. For performance the fixed fields idea sounds best. If you use PostgreSQL you could look at the JSON field but don't know the impact on so many rows.
For flexible models you could use the EAV pattern but this works only for small data sets in my experience.

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

I have several (1-5) very wide (~50,000 columns) .csv files. The files are (.5GB-1GB) in size (avg. size around 500MB). I need to perform a join on the files on a pre-specified column. Efficiency is, of course, the key. Any solutions that can be scaled out to efficiently allow multiple join columns is a bonus, though not currently required. Here are my inputs:
-Primary File
-Secondary File(s)
-Join column of Primary File (name or col. position)
-Join column of Secondary File (name or col. position)
-Left Join or Inner Join?
Output = 1 File with results of the multi-file join
I am looking to solve the problem using a C-based language, but of course an algorithmic solution would also be very helpful.
Assuming that you have a good reason not to use a database (for all I know, the 50,000 columns may constitute such a reason), you probably have no choice but to clench your teeth and build yourself an index for the right file. Read through it sequentially to populate a hash table where each entry contains just the key column and an offset in the file where the entire row begins. The index itself then ought to fit comfortably in memory, and if you have enough address space (i.e. unless you're stuck with 32-bit addressing) you should memory-map the actual file data so you can access and output the appropriate right rows easily as you walk sequentially through the left file.
Your best bet by far is something like Sqlite, there's C++ bindings for it and it's tailor made for lighting fast inserts and queries.
For the actual reading of the data, you can just go row by row and insert the fields in Sqlite, no need for cache-destroying objects of objects :) As an optimization, you should group up multiple inserts in one statement (insert into table(...) select ... union all select ... union all select ...).
If you need to use C or C++, open the file and load the file directly into a database such as MySQL. The C and C++ languages do not have adequate data table structures nor functionality for manipulating the data. A spreadsheet application may be useful, but may not be able to handle the capacities.
That said, I recommend objects for each field (column). Define a record (file specific) as a collection of fields. Read a text line from a file into a string. Let the record load the field data from the string. Store records into a vector.
Create a new record for the destination file. For each record from the input file(s), load the new record using those fields. Finally, for each record, print the contents of each field with separation characters.
An alternative is to whip up a 2 dimensional matrix of strings.
Your performance bottleneck will be I/O. You may want to read huge blocks of data in. The thorn to the efficiency is the variable record length of a CSV file.
I still recommend using a database. There are plenty of free ones out there, such as MySQl.
It depends on what you mean by "join". Are the columns in file 1 the same as in file 2? If so, you just need a merge sort. Most likely a solution based on merge sort is "best". But I agree with #Blindy above that you should use an existing tool like Sqlite. Such a solution is probably more future proof against changes to the column lists.

SQLite3 for Serialization Purposes

I've been tinkering with SQLite3 for the past couple days, and it seems like a decent database, but I'm wondering about its uses for serialization.
I need to serialize a set of key/value pairs which are linked to another table, and this is the way I've been doing this so far.
First there will be the item table:
CREATE TABLE items (id INTEGER PRIMARY KEY, details);
| id | details |
-+----+---------+
| 1 | 'test' |
-+----+---------+
| 2 | 'hello' |
-+----+---------+
| 3 | 'abc' |
-+----+---------+
Then there will a table for each item:
CREATE TABLE itemkv## (key TEXT, value); -- where ## is an 'id' field in TABLE items
| key | value |
-+-----+-------+
|'abc'|'hello'|
-+-----+-------+
|'def'|'world'|
-+-----+-------+
|'ghi'| 90001 |
-+-----+-------+
This was working okay until I noticed that there was a one kilobyte overhead for each table. If I was only dealing with a handful of items, this would be acceptable, but I need a system that can scale.
Admittedly, this is the first time I've ever used anything related to SQL, so perhaps I don't know what a table is supposed to be used for, but I couldn't find any concept of a "sub-table" or "struct" data type. Theoretically, I could convert the key/value pairs into a string like so, "abc|hello\ndef|world\nghi|90001" and store that in a column, but it makes me wonder if that defeats the purpose of using a database in the first place, if I'm going to the trouble of converting my structures to something that could be as easily stored in a flat file.
I welcome any suggestions anybody has, including suggestions of a different library better suited to serialization purposes of this type.
You might try PRAGMA page_size = 512; prior to creating the db, or prior to creating the first table, or prior to executing a VACUUM statement. (The manual is a bit contradictory and it also depends on the sqlite3 version.)
I think it's also kind of rare to create tables dynamically at a high rate. It's good that you are normalizing your schema, but it's OK for columns to depend on a primary key and, while repeating groups are a sign of lower normalization level, it's normal for foreign keys to repeat in a reasonable schema. That is, I think there is a good possibility that you need only one table of key/value pairs, with a column that identifies client instance.
Keep in mind that flat files have allocation unit overhead as well. Watch what happens when I create a one byte file:
$ cat > /tmp/one
$ ls -l /tmp/one
-rw-r--r-- 1 ross ross 1 2009-10-11 13:18 /tmp/one
$ du -h /tmp/one
4.0K /tmp/one
$
According to ls(1) it's one byte, according to du(1) it's 4K.
Don't make a table per item. That's just wrong. Similar to writing a class per item in your program. Make one table for all items, or perhaps, store the common parts of all items, with other tables referencing it with auxillary information. Do yourself a favor and read up on database normalization rules.
In general, the tables in your database should be fixed, in the same way that the classes in your C++ program are fixed.
Why not just store a foreign key to the items table?
Create Table ItemsVK (ID integer primary key, ItemID integer, Key Text, value Text)
If it's just serialization, i.e. one-shot save to disk and then one-shot restore from disk, you could use JSON (list of recommend C++ libraries).
Just serialize a datastructure:
[
{'id':1,'details':'test','items':{'abc':'hello','def':'world','ghi':'90001'}},
...
]
If you want to save some bytes, you can omit the id, details, and items keys and save a list instead: (in case that's a bottleneck):
[
[1,'test', {'abc':'hello','def':'world','ghi':'90001'}],
...
]