How to subtract 2 columns with dtype = object within data frame to form a new column of the difference pandas - python-2.7

I have a merge data frame(mdf) which the 2 data frames are retrieved from SQL. I wish to create a new col within mdf which will be the subtraction of existing 2 columns.

I'm not sure what you mean by a "merge data frame," but here's a sketch of what you might be after. Please elaborate a little your question so it will be more useful to others.
df = pd.read_sql('select ....', some_sql_connection)
df['difference'] = df['some column name'] - df['another column name']
Also, referring to the title of your question where you mention dtype=object, data extracted from a SQL database sometimes defaults to the generic object datatype, even if it is actually numeric. (This is not ideal, and better handling of datatypes to and from SQL databases is being actively improved for a future release of pandas.)
For now, before manipulating your data, you might want to run df.convert_objects(convert_numeric=True) if you have all numerical data. See documentation.

Related

Superset partion graph type order by

We are creating several charts in superset and with the partition type chart the ORDER BY seems to be hard coded and we cannot change it. The goal is too have the months on the left in the correct order (the column in this case is Month). When run in sql lab it works in correct order but in the chart view we cannot change the ordering
Any suggestions?
I assume you mean the dates on the right here?
I work with superset and I have experienced this limitation that does appear to be hard-coded into the ordering once a chart is made. I would suggest if it wasn't too much hassle to add another column to your database of the text value and follow the patter of;
WHERE "Month" = 'January' SET "OrderingColumn" = 'A'
WHERE "Month" = 'February' SET "OrderingColumn" = 'B'
etc etc
Then in your charts you can try: ORDER BY "OrderingColumn"
It is a bit of an inconvenience but if you are able to manipulate your data by changing tables or views this seems to be a solution you could use.
I hope this may be useful even to change the way of approaching the problem.

Alternatives to dynamically creating model fields

I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.

Google Big Query splitting an ingestion time partitioned table

I have an ingestion time partitioned table that's getting a little large. I wanted to group by the values in one of the columns and use that to split it into multiple tables. Is there an easy way to do that while retaining the original _PARTITIONTIME values in the set of new ingestion time partitioned tables?
Also I'm hoping for something that's relatively simple/cheap. I could do something like copy my table a bunch of times and then delete the data for all but one value on each copy, but I'd get charged a huge amount for all those DELETE operations.
Also I have enough unique values in the column I want to split on that saving a "WHERE column = value" query result to a table for every value would be cost prohibitive. I'm not finding any documentation that mentions whether this approach would even preserve the partitions, so even if it weren't cost prohibitive it may not work.
Case you describe required having two level partitioning which is not supported yet
You can create column partition table https://cloud.google.com/bigquery/docs/creating-column-partitions
And after this build this value of column as needed that used to partitioning before insert - but in this case you lost _PARTITIONTIME value
Based on additional clarification - I had similar problem - and my solution was to write python application that will read source table (read is important here - not query - so it will be free) - split data based on your criteria and stream data (simple - but not free) or generate json/csv files and upload it into target tables (which also will be free but with some limitation on number of these operations) - will required more coding/exception handling if you go second route.
You can also can do it via DataFlow - it will be definitely more expensive than custom solution but potentially more robust.
Examples for gcloud python library
client = bigquery.Client(project="PROJECT_NAME")
t1 = client.get_table(source_table_ref)
target_schema = t1.schema[1:] #removing first column which is a key to split
ds_target = client.dataset(project=target_project, dataset_id=target_dataset)
rows_to_process_iter = client.list_rows( t1, start_index=start_index, max_results=max_results)
# convert to list
rows_to_process = list(rows_to_process_iter)
# doing something with records
# stream records to destination
errors = client.create_rows(target_table, records_to_stream)
BigQuery now supports clustered partitioned tables, which allow you to specify additional columns that the data should be split by.

How to work with data different from schema in pandas python

I am currently using pandas (0.22.0) with read_table with names.
How can I address when my underlying data schema changes?
For example, my read_table is reading 5 columns and the data file has 5 columns. How would I tackle changes in the data(when a new column is added to the data, does that mean that I have to update schema when the data format changes? Is there a way to ignore the columns not mentioned via names in Pandase read_table
there is a usecols parameter that you can pass to read_table to read only a subset of the available data. So long as the 5 columns that you are concerned with are always present, you should be able to name them explicitly in the call.
cols_of_interest = ['col1', 'col2', 'col3', 'col4', 'col5']
df = pd.read_table(file_path, usecols=cols_of_interest)
Documentation for pd.read_table here - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html
Note that you can also pass a callable which can decide which columns to parse, or specify column indices instead of named columns (depends on the underlying data I guess).
The problem I have here is I am iterating over data files with a set schema with read_table and names. I do not want to be updating schema every time when the underlying data changes.
I found a work-around (more of a hack) at this point. I added a few 'dummy' columns to names array.

Fetching data from large BigQuery table in python

What I have is a BigQuery table(>5mil rows).
I need to fetch this data in batches and process it inside AppEngine, python.
The only way to fetch from a table that I know is to run SELECT query on this table and then iterate the result using tokens fetch_data returns.
It looks like this:
query = u"""\
SELECT url FROM %s
""" % (query_table)
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.begin()
wait_for_job(query_job, 1)
query_results = query_job.results()
rows, total_rows, next_token = query_results.fetch_data(max_results=per_page, page_token=page_token)
This works on smaller tables, but on larger ones like mine it asks to allow large requests and specify target table. But this makes no sense to me. For to simply fetch data from a table I have to copy it to another table?
What you are running into is described in this documentation. In summary, apart from the limit on how much data can be fetched at a time, there is a point where your results become "large results." This is when your results are more than 128MB compressed as described here. When your results are classified as large, you can only store the result of a query in a table in Big Query.
Unfortunately I'm not sure there's a nice way to do what you want without reducing how many rows you are retrieving at once. What you'll likely need to do is explore the exporting data documentation for big query.
You should use tabledata.list API for fetching data from table.
Using parameters (startIndex or pageToken) and maxResults you can control size of page you fetch.
I think this is exactly what you need link, as far as I understood you can't get a large result of a query but you can get the entire table data to your app no mater how big it is, thats why you need to put the large result in a table and then get this table data to your app and do whatever you want with it
good luck :)