I am not sure if it relates to "bitwise". I store a music file provided with different format. eg. MP3, WAV, midi... It needs to store the provided type in the DB. One of the solution is to create individual db fields/columns for each format. eg withMP3, withWav, withMidi... But once I add one more format, I need to create an extra column.
Is there any standard solution to store the format to one field? For example first digit store with mp3, second digit store with wav... Once I add one more file format, it just needs to append one more bit to the data, no need to add new column. I am not sure this question related to any aspect. Hope that someone can help me.
Many thanks!!
Turn that data into its own table (id, format, blob) then you can associate them with the rows in the other table via another table. That way the schema is independent of the number of formats.
I'm not sure why you try to store this information as fields. I would just store the mime type, that is normally enough information for a normal database.
Related
I was thinking about the difference between those two approches.
Imagine you must handle information about pattern calls, which later should be
displayed to the user. A pattern call is a tuple consisting of a unique integer
identifier ("id"), a user defined name (“name"), a project relative path to the so
called pattern file ("patternFile") and a convenience flag, which states whether
the pattern should be called or not called. And the number of tuples are not known before and they won't be modified after initialization.
I thought that in this case a column based approach with big query for example would be better in terms of I/O and performance as well as the evolution of the schema. But actually I can't understand why. I would appreciate any help.
Amazon S3 is like a large key-value store. The Key is the filename (with full path) and the Value is the contents of the file. It's just a blob of data.
A columnar data store organizes data in such a way that specific data can be "jumped to", and only desired values need to be read from disk.
If you are wanting to perform a search on the data, then some form of logic is required on the data. This could be done by storing data in a database (typically a proprietary format) or by using a columnar storage format such as Parquet and ORC plus a query engine that understands this format (eg Amazon Athena).
The difference between S3 and columnar data stores is like the difference between a disk drive and an Oracle database.
I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a new column without rewriting the entire file?
The short answer is probably no.
Arrow's in-memory format & libraries support this. You can add a chunked array to a table by just creating a new table (this should be zero-copy).
However, it appears you are talking about storing tables in files. None of the common file formats in use (parquet, csv, feather) support partitioning a table in this way.
Keep in mind, if you are reading a parquet file, you can specify which column(s) you want to read and it will only read the necessary data. So if your goal is only to support individual column retrieval/query then you can just build one large table with all your columns.
From a template users can upload a csv file which gets parsed in
def parseCSV(request):
magic happens here (conforming date formats and all such fun things)
return column names to template
This view returns a list of columns and the user is asked to pick x columns to save.
The users choice is posted to
def saveCSV(request):
logic for saving
Now my question is, how do I most correctly handle the csv data object between view 1 and 2? Do i save it as a temperary file or do i send it back and forth view1->template->view2 as a data object? Or maybe something third?
There is no "correct" way as it all depends on the concrete situation. In this case, it depends on the size of the data from the CSV file. Given that the data is rather large, the best approach is most likely to store the parsed data on the server, and then in the next request only send the user's selection of the full data set.
I would suggest you to parse the data and store it as a JSON blob in the database, so that you can easily retrieve it for the next request. This way you can send the user's selection of rows and columns (or "coordinates"), and save that as real data afterwards. The benefit of storing it right away is that the user can return to the process even after leaving the flow. The downside is, though, that you save unused data, if the user never completes the process, and you might need to clear this later. If you store it in a table containing only temporary data, it should ease the cleaning process.
I would like to parse the CSV file at the frontend and give an option to user to choose columns. After choosing columns, I would send these columns with value to Backend.
I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.
I am coding a cpp project with the database "postgreSQL".
I created a table in my database its type is character varying(40).
Now I need to SELECT these data FROM the table in my cpp project. I knew that I should use the library libpq, this is the interface of "postgreSQL" for c/cpp.
I have succeeded in selecting data from the table. Now I am considering if it's possible to get the data type of this table. For example, here I want to get character varying(40).
You need to use PQftype.
As described here: http://www.idiap.ch/~formaz/doc/postgreSQL/libpq-chapter17861.htm
And just take a look here about decoding return values: http://www.postgresql.org/message-id/da7021e0608040738l3b0880a1q5a76b838937f8c78#mail.gmail.com
You must also use PQfsize to get field size.