In Apache Arrow, how can I update/delete data in vectors - apache-arrow

In Apache Arrow, how can I update/delete data base on conditions in a vector, what's the best practice of doing this?

Related

DynamoDB Indexing Assistance and Getting My Data Out

I preface all of this to say I’m still actively learning DynamoDB, and I think an answer to my question will help me understand a few things.
I have an analytics microservice that I’m pushing custom (internal) analytics events into a DynamoDB table. Columns in our Dynamo rows/items include data like:
User ID
IP Address
Event Action
Timestamp
Split Test ID
Split Test Value
One of the main questions we want to pull from this db is:
"How many users saw split test x with values y?"
I’m struggling to understand how I should index my database to account for this kind of requests? I set up a “Keys Only” index targeting Split Test ID, and the query to gather these are fairly efficient, but it only pulls UserID and Split Test ID. Ideally I want an efficient query that returns multiple other associated values as well…
How do I achieve this? Do I need to be doing something much differently? Additionally, if any of my understanding of Dynamo, based on my explanations, sounds completely lacking in some regard, please point me in the right direction!
You're thinking of DynamoDB as a schema-less database, which it obviously is. However, that does not mean that a schema is not important. Schemas in NoSQL databases are usually more important than they are in SQL databases and they are usually less straightforward.
The most important thing to determine how you will store your data is how you will access it. You will have to take into account all the ways that you will want to access your data and ensure it is possible by creating the necessary data columns and necessary indexes. In this case, if you want to know how many times two values are combined in a certain way, you could easily add a column that has these combined values (e.g., splitId#splitValue ) and use that in your indexes.
If you want to know more about advanced patterns and such, I advise you to watch this pretty famous re:invent talk by Rick Houlihan or to read the DynamoDB book.
As a last note, I want to add that switching to a SQL server usually is not the solution. Picking NoSQL over SQL is usually based on non-functional requirements. There is a reason NoSQL databases are used in applications that require very low-latency retrieval of data in huge datasets, but as with everything, trade-offs are the name of the game.

ODBC Equivalent of DBMS_ALERT in Oracle

Is there anything (system procedure,function or other) in SQL Server that will provide the functionality of DBMS_ALERT package of ORACLE (and DBMS_PIPE respectively)?
I work in a plant and I'm using an extension-product of SQL-Server called InSQL Server by Wonderware which is specialized in gothering data from plant controllers and HumanMachineInterface(SCADA) software.
This system can record events happening in the plant (like a high-temperature alarm, for example). It stores sensor values in extension tables of SQL Sever, and other less dense information in normal SQL Server tables.
I want to be able to alert some applications running on operator PCs that an event has been recorded in the database.
An after insert trigger in the events table seems to be a good place to put something equivalent to DBMS_ALERT (if it exists), to wake up other applications that are waiting for the specific alert and have the operators type in some data.
In other words - I want to be able to notify other processes (that have connection to SQL Server) that something has happened in the database.
All Wonderware (InSQL but now called Aveva) Historian data is stored in the history blocks EXCEPT for the actual tag storage configuration and dedicated event data. The time series data for analog, discrete and strings is NOT in SQL tables at all - unless someone is doing custom configuration to. create tables of their own.
Where are you wanting these notifications to come up? Even though the historical data is NOT stored in SQL tables, Wonderware has extensive documentation on how to use SQL queries to appropriately retrieve data (check for whatever condition you are looking for)
You can easily build a stored procedure and configure it for a maintenance plan.
But are you just trying to alarm (provide notification) on the scada itself?
Or are you truly utilizing historical data (looking for a data trend - average, etc.)?
Or trying to send the notification to non-scada interfaces?
Depending on your specific answer, the scada itself should probably be able to do it.
But there is software that already does this type of thing Win-911, SeQent, Scadatec are a couple in the OT space. But also things like Hip Link or even DeskAlert which can connect to any SQL via it's own API.
So where does the info need to go (email, text, phone, desktop app...) and what is the real source of the data>

Amazon Redshift schema design

We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.
I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.
By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.
Thanks.
With my experience in working with Redshift, I can assert the following points with confidence:
Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.
Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.
Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.
Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.
You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.
I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).
For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.
In addition to the other responses, here are some suggestions for improving schema performance.
First: Automatic compression encodings using COPY command
Improve the performance of Amazon Redshift using the COPY command. It will get data into Redshift database. The COPY command is clever enough. It automatically chooses the most appropriate encoding settings for the data it uploads. You don’t have to think about it. However, it does so only for the first data upload into an empty table.
So, make sure to use a significant data set while uploading data for the first time, which Redshift can assess to set the column encodings in the best way. Uploading a few lines of test data will confuse Redshift to know how best to optimize the compression to handle the real workload.
Second: Use Best Distribution Style and Key
Distribution-style decides how data is distributed across the nodes. Applying a distribution style at table level tells Redshift how you want to distribute the table and the key. So, how you specify distribution style is important for good query performance with Redshift. The style you choose may affect requirements for data storage and cluster. It also affects the time taken by the COPY command to execute.
I recommend setting the distribution style to all tables with a smaller dimension. For large dimension, distribute both the dimension and associated fact on their join column. To optimize the second large dimension, take the storage-hit and distribute ALL. You can even design the dimension columns into the fact.
Third: Use the Best Sort Key
A Redshift database maintains data in a table with an arrangement of a sort-key-column if specified. Since it’s sorted in each partition; each cluster node upholds its partition in predefined order. (While designing your Redshift schema, also consider the impact on your budget. Redshift is priced by amount of stored data and by the number of nodes.)
Sort key optimizes Amazon Redshift performance significantly. You can do it in many ways. First, use data filtering. If where-clause filters on a sort-key-column, it skips the entire data blocks. It’s because Redshift saves data in blocks. Each block header records the minimum and maximum sort key value. Filter outside of that range, the entire block may get skipped.
Alternatively, when joining two tables, sorted on their joint keys, the data is read in matching order. Also, you can merge-join without separate sort-steps. Joining large dimension to a large fact table will be easy with this method because neither will fit into a hash table.

Is there any webapp for hbase data visualization

I want to view my data of hbase through browser. Is there any tool for this purpose so that I can view, manipulate data and get some result. Hbase shell interface is not easy to understand as I am new to hbase
If you can choose what distribution to use, I'll recommend you Cloudera. There is a good hbase data browser inside Hue. It supports filtering by families/columns and search in hbase table.
Also, it is possible to add data in cells
And edit it

Coldfusion: Move data from one datasource to another

I need to move a series of tables from one datasource to another. Our hosting company doesn't give shared passwords amongst the databases so I can't write a SQL script to handle it.
The best option is just writing a little coldfusion scripty that takes care of it.
Ordinarily I would do something like:
SELECT * INTO database.table FROM database.table
The only problem with this is that cfquery's don't allow you to use two datasources in the same query.
I don't think I could use a QoQ's either because you can't tell it to use the second datasource, but to have a dbType of 'Query'.
Can anyone think of any intelligent ways of getting this done? Or is the only option to just loop over each line in the first query adding them individually to the second?
My problem with that is that it will take much longer. We have a lot of tables to move.
Ok, so you don't have a shared password between the databases, but you do seem to have the passwords for each individual database (since you have datasources set up). So, can you create a linked server definition from database 1 to database 2? User credentials can be saved against the linked server, so they don't have to be the same as the source DB. Once that's set up, you can definitely move data between the two DBs.
We use this all the time to sync data from our live database into our test environment. I can provide more specific SQL if this would work for you.
You CAN access two databases, but not two datasources in the same query.
I wrote something a few years ago called "DataSynch" for just this sort of thing.
http://www.bryantwebconsulting.com/blog/index.cfm/2006/9/20/database_synchronization
Everything you need for this to work is included in my free "com.sebtools" package:
http://sebtools.riaforge.org/
I haven't actually used this in a few years, but I can't think of any reason why it wouldn't still work.
Henry - why do any of this? Why not just use SQL manager to move over the selected tables usign the "import data" function? (right click on your dB and choose "import" - then use the native client and permissions for the "other" database to specify the tables. Your SQL manager will need to have access to both DBs, but the db servers themselves do not need access to each other. Your manager studio will serve as a conduit.