Create.io Order of column when creating a table - database-migration

I have CrateDb version 3.2.7 running under Windows Server 2012. I create a table like this:
create table test3 (firstcolumn bigint primary key, secondcolumn int, thirdcolumn timestamp, fourthcolumn double, fifthcolumn double, sixtcolumn smallint, seventhcolumn double, heightcolumn int, ninthcolumn smallint, tenthcolumn smallint) clustered into 12 shards with(number_of_replicas = 0, refresh_interval =0);
So I'm expecting the firstcolumn to be the first, and so on. But after the creation, when I do a SELECT * FROM test3, I get the following result:
It seems that the first column returned is the "fifth" Looks like columns are returned in alphabetical order.
Does it means that CrateDB created the columns in that order? Does it keeps the order somewhere? If columns are in alphabetical order, does that mean that if I want to COPY data from another dbms to CrateDB, then I have to export data based on alphabetical order?

For insert not necessarily, only if they are omitted do they have to be in an alphabetical order see here. Order doesn't seem to be "kept" anywhere per se.
COPY FROM is a different kind of import tactic and not quite what the good old INSERT would do. I would suggest writing a command line app to import data into cratedb. COPY FROM doesn't do any type checking, nor does it cast types and will always import the data as it was in the source file (see here). From your other question I see you may have gps related data (?) you will need to manually map them to a GEO_POINT type just as 1 example.
Crate offers good performance (whatever that means to you or me) with bulk endpoint

Related

Google Cloud DataPrep DATEDIF function inconsistent

I have four DateTime columns, all in long format eg 2016-08-01T21:13:02Z. They are called EnqDateTime, QuoteCreatedDateTime, BookingCreatedDateTime and RejAt.
I want to add columns for the duration (in days) between EnquiryDateTime and the other three columns, i.e.
DATEDIF(EnqDateTime, QuoteCreatedDateTime, day)
This works for RejAt, but throws an error for all the other columns:
Parameter "rhs" accepts only ["Datetime"]
As per the image below, all four columns are DateTime.
Can anyone see any other reason this may not be working for 2 of the three columns?
As you can see in the image below, I reproduced an scenario such as the one you presented here, and I had no issue with it. I create the three columns X2Y using the same formula that you shared:
DATEDIF(EnqDateTime, QuoteCreatedDateTime, day)
DATEDIF(EnqDateTime, BookingCreatedDateTime, day)
DATEDIF(EnqDateTime, RejAt, day)
My guessing is that, for some reason, the columns do not have an appropriate Datetime format. Maybe you can try applying some transformations to the data in order to make sure that the data contained in the columns has the appropriate format. I recommend that you try doing the following:
Clean all missing values, clicking on the column and then Clean > Missing > Fill with NULL. Missing values can prevent Dataprep from recognizing a data type properly.
Change the data type again to Datetime, just to doublecheck that there is not any field that does not have the Datetime type. You can do so by clicking on the column and then Change type > Date/Time.
If these methods do not solve your issue, maybe you can try working with a minimal example, having only a few rows, so that you can narrow down the variables with which to work. Then you can update your question with more information.
It would also be nice to know where are you getting the error Parameter "rhs" accepts only ["Datetime"]. It is not clear for me what the rhs (Right Hand Side) parameter is in this case, so maybe you can also provide more details about that.

Fastest way to select several inserted rows

I have a table in a database which stores items. Each item has a unique ID, which the DB generates upon insertion (auto-increment).
A user may perform a specific task that will add X items to the database, however my program (C++ server application using MySQL connector) should return the IDs that the database generated right away. For example, if I add 6 items, the server must return 6 new unique IDs to the client.
What is the fastest/cleanest way to do such thing? So far I have been doing INSERT followed by SELECT for each new item OR INSERT followed by last_insert_id, however if there are 50 items to add it will take a few seconds at least which is not good at all for user experience.
sql_task.query("INSERT INTO `ItemDB` (`ItemName`, `Type`, `Time`) VALUES ('%s', '%d', '%d')", strName.c_str(), uiType, uiTime);
Getting the ID:
uint64_t item_id { sql_task.last_id() }; //This calls mysql_insert_id
I believe you need to rethink your design slightly. Let's use the analogy of a sales order. With a sales order (or invoice #) the user gets an invoice number (auto_incr) as well as multiple line item numbers (also auto_inc).
The sales order and all of the line items are selected for insert (from the GUI) and the inserts are performed. First, the sales order row is inserted and its id is saved in a variable for subsequent calls to insert the line items. But the line items are then just inserted without immediate return of their auto_inc id values. The application is merely returned the sales order number in the end. How your app uses that sales order number in subsequent calls is up to you. But it does not need to be immediate to retrieve all the X or 50 rows at once, as it has the sales order number iced and saved somewhere. Let's call that sales order number XYZ.
When you actually need the information, an example call could look like
select lineItemId
from lineItems
where salesOrderNumber=XYZ
order by lineItemId
You need to remember that in a multi-user system that there is no guarantee of receiving a contiguous block of numbers. Nor should it matter to you, as they are all attached appropriately with the correct sales order number.
Again, the above is just an analogy, used for illustration purposes.
That's a common but hard to solve problem. Unsure for mysql, but PostreSQL uses sequences to generate automatic ids. Inserting frameworks (object relationnal mappers) use that when they expect to insert many values: they query directly the sequence for a bunch of IDs and then insert new rows using those already known IDs. That way, no need for an additional query after each insert to get the ID.
The downside is that the relation ID - insertion time can be non monotonic when different writers intermix their inserts. It is not a problem for the database, but some (poorly written?) program could expect it is.
As you ID is autoincremental, you can do only two SELECT queries - before and after INSERT queries:
SELECT AUTO_INCREMENT FROM information_schema.tables WHERE table_name = 'dbTable' AND table_schema = DATABASE();
--
-- INSERT INTO dbTable... (one or many, does not matter);
--
SELECT LAST_INSERT_ID() AS lastID;
This will give you the siquence between first and last inserted IDs. Then you can easily calculate how many they are.

How to make a x column table in mysql?

So I want to make a table in MYSQL (using c++) with about 128 columns each on representing an INT.
I don't know the syntax to make a 129 column row (1 for id 128 for each int)
Kinda like an array: int myArray[128];
CREATE TABLE SIFTFEATUES(ID INT not null, myArray[128] INT) would be Ideal or something close
where I don't have to write out each column name.
To have a table with 128 columns defined, you need to "write out" each column name. Each column in the table gets a name, a datatype, and other optional attributes.
In order to retrieve data from the column, it has to be referenced by name; the same is true for inserting and updating the contents of the column.
(I put "write out" in double quotes, because it's very simple task to generate a text file of 128 lines of a table definition that vary only by name. (It's not necessary to type each line.)
, col001 int
, col002 int
, ...
, col128 int
Absent any other information about your use case, what you are trying to accomplish, it's nearly impossible to make any sort of recommendation.
A table should be use for data persistence, thus not to be used like an array (with obvious exceptions). With that said I see two scenarios:
1) You want to use it like a temporary data structure, which I strongly do not recommend, unless for medical reasons.
2) You want to keep that table, for which I'd use a text editor or even MS Excel and create a macro to create the CREATE TABLEstatement with your 128 columns.

MySQL Performance issues with large amounts of data

I have a software project that I am working on at work that has been driving me crazy. Here's our problem: we have a series data contacts that need to be logged every second. It needs to include time, bearing (array of 360-1080 bytes), range, and a few other fields. Our system also needs the capability to store this data for up to 30 days. In practice, there can be up to 100 different contacts, so at a maximum, there can be anywhere from around 150,000,000 points to about 1,000,000,000 different points in 30 days.
I'm trying to think of the best method for storing all of this data and retrieving later on. My first thought was to use some RDBMS like MySQL. Being a embedded C/C++ programmer, I have very little experience working with MySQL with such large data sets. I've dabbled with it on small datasets, but nothing nearly as large. I generated the below schema for two tables that will store some of the data:
CREATE TABLE IF NOT EXISTS `HEADER_TABLE` (
`header_id` tinyint(3) unsigned NOT NULL auto_increment,
`sensor` varchar(10) NOT NULL,
`bytes` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`header_id`),
UNIQUE KEY `header_id_UNIQUE` (`header_id`),
UNIQUE KEY `sensor_UNIQUE` (`sensor`)
) ENGINE=MyISAM AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `RAW_DATA_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`transverse` bit(1) NOT NULL default b'0',
`data` varbinary(1080) NOT NULL,
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
UNIQUE KEY `internal_id_UNIQUE` (`internal_id`),
KEY `time` (`time_sec`)
KEY `internal_id` (`internal_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `rel_RASTER_TABLE` (
`internal_id` bigint(20) NOT NULL auto_increment,
`raster_id` int(10) unsigned NOT NULL,
`time_sec` bigint(20) unsigned NOT NULL,
`time_nsec` bigint(20) unsigned NOT NULL,
`header_id` tinyint(3) unsigned NOT NULL,
`data_id` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
The header table only contains 10 rows and is static. It just tells what sensor the raw data came from, and the number of bytes output by that type of sensor. The RAW_DATA_TABLE essentially stores the raw bearing data (an array of 360-1080 bytes, it represents up to three samples per degree). The rel_RASTER_TABLE holds meta data for the RAW_DATA_TABLE, there can be multiple contacts that refer to the same raw data row. The data_id found in rel_RASTER_TABLE points to the internal_id of some row in the RAW_DATA_TABLE, I did this to decrease the amount of writes needed.
Obviously, as you can probably tell, I'm having performance issues when reading and deleting from this database. An operator to our software can see real time data as it comes across and also go into reconstruction mode and overlay a data range from the past, the past week for example. Our backend logging server grabs the history rows and sends them to a display via a CORBA interface. While all of this is happening, I have a worker thread that deletes 1000 rows at a time for data greater than 30 days. This is there in case a session runs longer than 30 days, which can happen.
The system we currently have implemented works well for smaller sets of data, but not for large sets. Our select and delete statements can take upwards of 2 minutes to return results. This completely kills the performance of our real time consumer thread. I suspect we're not designing our schemas correctly, picking the wrong keys, not optimizing our SQL queries correctly, or some subset of each. Our writes don't see to be affected unless the other operations take too long to run.
Here is an example SQL Query we use to get history data:
SELECT
rel_RASTER_TABLE.time_sec,
rel_RASTER_TABLE.time_nsec,
RAW_DATA_TABLE.transverse,
HEADER_TABLE.bytes,
RAW_DATA_TABLE.data
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
I apologize in advance for this being such a long question, but I've tapped out other resources and this is my last resort. I figure I'd try to be as descriptive as possible Do you guys see of any way I can improve upon our design at first glance? Or, anyway we can optimize our select and delete statements for such large data sets? We're currently running RHEL as the OS and unfortunately can't change our hardware configuration on the server (4 GB RAM, Quad Core). We're using C/C++ and the MySQL API. ANY speed improvements would be EXTREMELY beneficial. If you need me to clarify anything, please let me know. Thanks!
EDIT: BTW, if you can't provide specific help, maybe you can link me to some excellent tutorials you've come across for optimizing SQL queries, schema design, or MySQL tuning?
First thing you could try is de-normalizing the data. On a data set of that size, doing a join, even if you have indexes is going to require very intense computation. Turn those three tables into 1 table. Sure there will be duplicate data, but without joins it will be much easier to work with. Second thing, see if you can get a machine with enough memory to fit the whole table in memory. It doesn't cost much ($1000 or less) for a machine with 24GB of RAM. I'm not sure if that will hold your entire data set, but it will help tremendously Get an SSD as well. For anything that isn't stored in memory, an SSD should help you access it with high speed. And thirdly, look into other data storage technologies such as BigTable that are designed to deal with very large data sets.
I would say partitioning is an absolute must in a case like this:
large amount of data
new data coming in continuously
implicit: old data getting deleted continuously.
Check out this for mySQL.
Looking at your select stmt (which filters on time), I'll say partition on the time column.
Of course you might wanna add a few indexes based on the frequent queries you want to use.
--edit--
I see that many have suggested indexes. My experiences have been that having an index on a table with really large num of rows either kills the performance (eventually) or requires lot of resources (CPU, memory,...) to keep the indexes up to date.
So although I also suggest addition of indexes, please note that it's absolutely useless unless you partition the table first.
Finally, follow symcbean's advise (optimize your indexes in number and keys) when you add indexes.
--edit end--
A quickie on partitioning if you're new to it.
Usually a single table translates to a single data file. A partitioned table translates to one file per partition.
Advantages
insertions are faster as physically it's inserted into a smaller file (partition).
deletion of large number of rows would usually translate to dropping a partition (much much much much cheaper than 'delete from xxx where time > 100 and time < 200');
queries with a where clause on the key by which the table is partitioned is much much faster.
Index building is faster.
I don't have much experience with MySQL, but here are some a priori thoughts that jump to mind.
Is your select in a stored procedure?
The select's predicate is usually searched in the order its asked in. If the data on the disk is reordered to match the primary key, then doing raster id first is fine. You would be paying the cost of reordering on every insert though. If the data is stored in time order on disk, you would probably want to search on time_sec before raster_id.
WHERE
rel_RASTER_TABLE.raster_id = 2952704 AND
rel_RASTER_TABLE.time_sec >= 1315849228 AND
rel_RASTER_TABLE.time_sec <= 1315935628 AND
rel_RASTER_TABLE.data_id = RAW_DATA_TABLE.internal_id AND
rel_RASTER_TABLE.header_id = HEADER_TABLE.header_id;
Your indexes don't follow the search predicates.
It will create indexes based on the keys, generally.
PRIMARY KEY (`internal_id`, `raster_id`,`time_sec`,`time_nsec`),
KEY `raster_id` (`raster_id`),
KEY `time` (`time_sec`),
KEY `data` (`data_id`)
It may not be using the primary index because you aren't using internal_id. You may want to set internal_id as the primary key and create a separate index based on your search parameters. At least on raster_id and time_sec.
Are the joins too loose?
This may be my inexperience with MySQL, but I expect to see conditions on the joins. Does using FROM here do a natural join? I don't see any foreign keys specified, so I don't know how it would join these tables rationally.
FROM
RASTER_DB.HEADER_TABLE,
RASTER_DB.RAW_DATA_TABLE,
RASTER_DB.rel_RASTER_TABLE
Usually when developing something like this I would work with a smaller set and remove predicates to makes sure that each step meets what I expect. If you accidentally cast a wide net up front, then narrow down later you may mask some inefficiencies.
Most query optimizers have a way to output how the optimized, make sure it meets your expectations. One of the comments mention Explain plans, I assume that is what it is called.
Without knowing what all the queries are its difficult to give specific advice, however looking at the single query you have provided, there are no indexes which are idealy suited to resolving this.
In fact the structure is a bit messy - if internal_id is an auto-increment value then it is unique - why add other stuff in the primary key? It looks as if a more sensible structure for rel_RASTER_TABLE would be:
PRIMARY KEY (`internal_id`),
KEY (`raster_id`,`time_sec`,`time_nsec`),
And as for RAW_DATA_TABLE, it should be blindingly obvious that its indexes are far from optimal. And should probably be:
PRIMARY KEY (`internal_id`,`time_sec`,`time_nsec`),
KEY `time` (`time_sec`, `time_nsec`)
Note that removing redundant indexes will speed up inserts/updates.
Capturing slow queries should help - and learn how to use 'explain' to see what indexes are redundant / needed.
You may also get a performance boost by tuning the mysql instance - particularly increasing the sort and join buffers - try running mysqltuner
First, I would try to create a view with only the necessary info that needs to be selected between the different tables.
By the way, MySQL is not necessarily the most optimized database system for what you are trying to accomplish... Look into other solutions such Oracle, Microsoft SQL, PostgreSQL etc. Also, the performance will vary depending on the server being used.

How do I create a CAST in Informix to cast an LVARCHAR to TEXT?

What built-in routine can I make use of to cast data of type LVARCHAR to data of type TEXT?
The larger context: I have a table with a column that has been defined as LVARCHAR(4096). Now a developer wishes to change the data type of this column to TEXT. Ideally this would be done with:
ALTER TABLE foo MODIFY bar TEXT;
...but in such a case the following error is puked to the screen:
ALTER TABLE can not modify column (bar) type. Need a cast from the current type to the new type.
I have read up on the CREATE CAST construction, but I cannot begin to think what on earth the proper conversion function would look like. Without a function, Informix will not allow the CREATE CAST to work. That is, if I do, simply:
CREATE CAST (LVARCHAR AS TEXT)
...Informix tells me that a cast function is required (which makes sense).
Beware, Informix developers: if you inadvertently run into a problem like this, there is no way to get out of it using SQL or DDL alone. Let me repeat that.
If you have a VARCHAR or an LVARCHAR column that you need to migrate to be a TEXT column, and if you cannot afford to lose data in that column, there is no way to do this in SQL or DDL.
Instead, you must write a program that does the conversion for you inside the database driver, in memory. In my case, I used JDBC mutable result sets and copied the column to a new column, letting the JDBC driver perform the conversion, then dropped the old column, and renamed the new column back to the old column. This general pattern is the only way to migrate existing character data into a TEXT column.
#Storm: Which version of IDS/ODBC are you using? AFAIK, IDS 9 or 10 can't do that without using specific embedded C in server (See boulder site), but in no way you can do that directly through SQL. Blob related functions or so.
Othe way is by using UNLOAD / LOAD.
In my scenario, we have lots of problems: no admin rights to enterprise server, as we are service providers, we only can use database, but cannot modify structure. We cannot modify TEXT fields only by launching queries.