Large Volume Excel Data Pulls - Avoiding ODBC - web-services

We have a requirement to provide ad-hoc access to large subsets of a system's data to users to analyse in Excel.
We do not want to grant direct ODBC access. This will curb our ability to make DB layout changes without our users' processes breaking.
Web Services seem ill suited for the volume of data at stake, in the region of 100's of thousands of records.
What would you suggest as an alternative to direct ODBC access?

There is a database concept of a "view" which does exactly what you need - it allows to expose large set of data and gives you a freedom of DB schema changes as long as you take care of exposing the same data to a user.
I agree with you regarding web services - it is not only the volume of data, but also the fact getting web services to work with Excel (2007 and above) is far from trivial. Also you will lock your DB schema as much as you would with a view.
For the really, really huge number of records you can consider data warehousing - a separate db, where you provide a read only access for reporting purposes and feeding the data from your read/write database. The feed can be easily and quickly done via SSIS.
HTH

Related

How are you coping up with Bigquery especially when you came from traditional RDMS background like Oracle/Mysql?

I am new to BQ. I have a table with around 200 columns, when i wanted to get DDL of this table there is no ready-made option available. CATS is not always desirable.. some times we dont have a refernce table to create with CATS, some times we just wanted a simple DDL statement to recreate a table.
I wanted to edit a schema of bigquery with changes to mode.. previous mode is nullable now its required.. (already loaded columns has this column loaded with non-null values till now)
Looking at all these scenarios and the lengthy solution provided from Google documentation, and also no direct solution interms of SQL statements rather some API calls/UI/Scripts etc. I feel not impressed with Bigquery with many limitations. And the Web UI from Google Bigquery is so small that you need to scroll lot many times to see the query as a whole. and many other Web UI issues as you know.
Just wanted to know how you are all handling/coping up with BQ.
I would like to elaborate a little bit more to #Pentium10 and #guillaume blaquiere comments.
BigQuery is a serverless, highly scalable data warehouse that comes with a built-in query engine, which is capable of running SQL queries on terabytes of data in a matter of seconds, and petabytes in only minutes. You get this performance without having to manage any infrastructure.
BigQuery is based on Google's column based data processing technology called dremel and is able to run queries against up to 20 different data sources and 200GB of data concurrently. Prediction API allows users to create and train a model hosted within Google’s system. The API recognizes historical patterns to make predictions about patterns in new data.
BigQuery is unlike anything that has been used as a big data tool. Nothing seems to compare to the speed and the amount of data that can be fitted into BigQuery. Data views are possible and recommended with basic data visualization tools.
This product typically comes at the end of the Big Data pipeline. It is not a replacement for existing technologies but it complements them. Real-time streams representing sensor data, web server logs or social media graphs can be ingested into BigQuery to be queried in real time. After running the ETL jobs on traditional RDBMS, the resultant data set can be stored in BigQuery. Data can be ingested from the data sets stored in Google Cloud Storage, through direct file import or through streaming data
I recommend you to have a look for Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale book about BigQuery that includes walkthrough on how to use the service and a deep dive of how it works.
More than that, I found really interesting article for Data Engineers new to BigQuery, where you can find consideration regarding DDL and UI and best practices on Medium.
I hope you find the above pieces of information useful.

Simple, editable data in AWS

I am working on a project that deals with execution of several models one after the other. For this, users need to upload a lot of files (mostly CSV) for each workflow, and each files has several columns.
Since understanding each file is difficult for users of our application, we want to provide friendly names, small descriptions, help texts, etc. for each file, and display them on our website.
These names and descriptions should be editable by people who are not developers (but will have access to the AWS account). So we would prefer a storage for this that provides some convenient user-interface for this.
In the world of AWS, what would you recommend as a storage for this use-case? Is dynamodb an overkill / inconvenient for this?
Should we have our a separate user-interface and service to implement this feature?
Your choice of User Interface and Storage are completely independent.
Storage should be selected based upon the type of data and how it will be accessed. It might be relational if you are querying and joining a lot of data, or it might be NoSQL (DynamoDB or even Amazon S3) if you need fast, predictable performance but no complex querying.
The User Interface should not be impacted by the choice of storage. It should present the data for viewing/editing in a way that is most convenient for users. There is no reason to have UI drive the storage choice (unless you simply want to use Google Sheets as your frontend).

Redshift as a Web App Backend?

I am building an application (using Django's ORM) that will ingest a lot of events, let's say 50/s (1-2k per msg). Initially some "real time" processing and monitoring of the events is in scope so I'll be using redis to keep some of that data to make decisions, expunging them when it makes sense. I was going to persist all of the entities, including events in Postgres for "at rest" storage for now.
In the future I will need "analytical" capability for dashboards and other features. I want to use Amazon Redshift for this. I considered just going straight for Redshift and skipping Postgres. But I also see folks say that it should play more of a passive role. Maybe I could keep a window of data in the SQL backend and archive to Redshift regularly.
My question is:
Is it even normal to use something like Redshift as a backend for web applications or does it typically play more of a passive role? If not is it realistic to think I can scale the Postgres enough for the event data to start with only that? Also if not, does the "window of data and archival" method make sense?
EDIT Here are some things I've seen before writing the post:
Some say "yes go for it" regarding the should I use Redshift for this question.
Some say "eh not performant enough for most web apps" and support the front it with a postgres database camp.
Redshift (ParAccel) is an OLAP-optimised DB, based on a fork of a very old version of PostgreSQL.
It's good at parallelised read-mostly queries across lots of data. It's bad at many small transactions, especially many small write transactions as seen in typical OLTP workloads.
You're partway in between. If you don't mind a data loss window, then you could reasonably accumulate data points and have a writer thread or two write batches of them to Redshift in decent sized transactions.
If you can't afford any data loss window and expect to be processing 50+ TPS, then don't consider using Redshift directly. The round-trip costs alone would be horrifying. Use a local database - or even a file based append-only journal that you periodically rotate. Then periodically upload new data to Redshift for analysis.
A few other good reasons you probably shouldn't use Redshift directly:
OLAP DBs with column store designs often work best with star schemas or similar structures. Such schemas are slow and inefficient for OLTP workloads as inserts and updates touch many tables, but they make querying the data along various axes for analysis much more efficient.
Using an ORM to talk to an OLAP DB is asking for trouble. ORMs are quite bad enough on OLTP-optimised DBs, with their unfortunate tendency toward n+1 SELECTs and/or wasteful chained left joins, tendency to do many small inserts instead of a few big ones, etc. This will be even worse on most OLAP-optimised DBs.
Redshift is based on a painfully old PostgreSQL with a bunch of limitations and incompatibilities. Code written for normal PostgreSQL may not work with it.
Personally I'd avoid an ORM entirely for this - I'd just accumulate data locally in an SQLite or a local PostgreSQL or something, sending multi-valued INSERTs or using PostgreSQL's COPY to load chunks of data as I received it from an in-memory buffer. Then I'd use appropriate ETL tools to periodically transform the data from the local DB and merge it with what was already on the analytics server.
Now forget everything I just said and go do some benchmarks with a simulation of your app's workload. That's the only really useful way to tell.
In addition to Redshift's slow transaction processing (by modern DB standards) there's another big challenge:
Redshift only supports serializable transaction isolation, most likely as a compromise to support ACID transactions while also optimizing for parallel OLAP mostly-read workload.
That can result in all kinds of concurrency-related failures that would not have been failures on typical DB that support read-committed isolation by default.

Efficient way to transfer data from one django application to another

Currently, I'm working on a project where I have a server - client relationship between two django applications running on separate hosts.
The server has to store and provide a large amount of relational data, eg: Suppliers, Companys, Products, etc etc..
The client downloads data on request from the server and adds it to their database. clients can also upload from their station to the database to expand it.
The previous person that developed this used XMLRPC to transfer the vast (13MB typical) XML file from server to client. now really all we're sending are database agnostic objects to be stored in a database so i wondered if there was a more efficient way of doing it?
Please ask for more details if you need them, I wasn't really sure what you'd need to know
EDIT: Efficient in terms of Networking, and Server Side Processing. Clients can do the heavy lifting.
A shared database design seems more suitable. But of course there may be security, political or organisational reasons ruling that out. Plus there would be significant re-design required.
To reduce network bandwidth first check that HTTP gzip compression is enabled.
If it's just a dumb data transfer JSON would generally be a lot more compact than XMLRPC. Does the data look amenable to a straight translation to JSON? This would still require some server-side processing.
For minimal server-side processing (if the database tables are relatively similar) it may be very efficient to just send the client a dump of the relevant db query. Of course unless the tables have the same schema you would have to do some client-side processing of raw SQL, which is not ideal.

Data Warehouse and Django

This is more of an architectural question than a technological one per se.
I am currently building a business website/social network that needs to store large volumes of data and use that data to draw analytics (consumer behavior).
I am using Django and a PostgreSQL database.
Now my question is: I want to expand this architecture to include a data warehouse. The ideal would be: the operational DB would be the current Django PostgreSQL database, and the data warehouse would be something additional, preferably in a multidimensional model.
We are still in a very early phase, we are going to test with 50 users, so something primitive such as a one-column table for starters would be enough.
I would like to know if somebody has experience in this situation, and that could recommend me a framework to create a data warehouse, all while mantaining the operational DB with the Django models for ease of use (if possible).
Thank you in advance!
Here are some cool Open Source tools I used recently:
Kettle - great ETL tool, you can use this to extract the data from your operational database into your warehouse. Supports any database with a JDBC driver and makes it very easy to build e.g. a star schema.
Saiku - nice Web 2.0 frontend built on Pentaho Mondrian (MDX implementation). This allows your users to easily build complex aggregation queries (think Pivot table in Excel), and the Mondrian layer provides caching etc. to make things go fast. Try the demo here.
My answer does not necessarily apply to data warehousing. In your case I see the possibility to implement a NoSQL database solution alongside an OLTP relational storage, which in this case is PostgreSQL.
Why consider NoSQL? In addition to the obvious scalability benefits, NoSQL offer a number of advantages that probably will apply to your scenario. For instance, the flexibility of having records with different sets of fields, and key-based access.
Since you're still in "trial" stage you might find it easier to decide for a NoSQL database solution depending on your hosting provider. For instance AWS have SimpleDB, Google App Engine provide their own DataStore, etc. However there are plenty of other NoSQL solutions you can go for that have nice Python bindings.