FLATTEN results using MAX value in BigQuery - google-cloud-platform

I need to flatten probabilities column in my results with the max number:
original predicted probabilities
<=50K >50K >50K 0.5377828170971353
<=50K 0.46221718290286473
<=50K <=50K >50K 0.05434716579642335
<=50K 0.9456528342035766
I would like to flatten my result, but now using this query I just get the table above and using bigQuery Python client get an: [object Object],[object Object]
SELECT
original,
predicted,
probabilities
FROM
ML.PREDICT(MODEL `my_dataset.my_model`,
(
SELECT
*
FROM  
`bigquery-public-data.ml_datasets.census_adult_income`
))

Your probabilities field is a REPEATED RECORD, i.e., an array of structs. You can use a subquery to iterate over the array and select the max probability, like this:
SELECT
original,
predicted,
(SELECT p
-- Iterate over the array
FROM UNNEST(probabilities) as p
-- Order by probability and get the first result
ORDER BY p.prob DESC
LIMIT 1) AS probabilities
FROM
ML.PREDICT(MODEL `my_dataset.my_model`,
(
SELECT
*
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
))
The result will look like this:
The python result you got looks more like a javascript representation of an object. Here's how I did it in python:
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
sql = ''' SELECT ... ''' # Your query
query_job = client.query(sql)
rows = query_job.result() # Waits for query to finish
for row in rows:
print(row.values())
Output:
(' >50K', ' >50K', {'label': ' >50K', 'prob': 0.5218586871072727})
(' >50K', ' >50K', {'label': ' >50K', 'prob': 0.5907989087876587})
(' >50K', ' >50K', {'label': ' >50K', 'prob': 0.734145221825564})
Note that probabilities is a struct data type in BigQuery SQL, so its mapped as a python dict.
Check the BigQuery quickstart for more information on client libraries.

Related

Adding constant values at the begining of a dataframe in pyspark

I am trying to read a CSV file from HDFS location and to that 3 columns batchid,load timestamp and a delete indicator needs to be added at the beginning. I am using spark 2.3.2 and python 2.7.5. Sample values for 3 columns to be added is given below.
batchid- YYYYMMdd (int)
Load timestamp - current timestamp (timestamp)
delete indicator - blank (string)
Your question is a little bit obscure. You can do something in this flavor. First, create your timestamp using python functionalities :
import time
import datetime
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
Then, assuming you use the DataFrame API, you plug that into your column :
import pyspark.sql.functions as psf
df = (df
.withColumn('time',
psf.unix_timestamp(
psf.lit(timestamp),'yyyy-MM-dd HH:mm:ss'
).cast("timestamp")
)
.withColumn('batchid', psf.date_format('time', 'yyyyMMdd/yyy'))
.withColumn('delete', psf.lit(''))
To reorder your columns:
df = df.select(*["time","batchid","delete"] + [k for k in colnames if k not in ["time","batchid","delete"]])

Remove repeated substring in column and only return words in between

I have the following dataframe:
Column1 Column2
0 .com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .comFinance
1 .com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .comFinanceDO
2 <br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> FinanceISVDODO Prem
3 <br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
4 <br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> ConsultingTTY
I used to following line of code to get Column2:
df['Column2'] = df['Column1'].str.replace('<br>', '', regex=True)
I want to remove all instances of "< b >" and so I want the column to look like this:
Column2
.com, Finance
.com, Finance, DO
Finance, ISV, DO, DO Prem
Finance
Consulting, TTY
Given the following dataframe:
Column1
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br>
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br>
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br>
df['Column2'] = df['Column1'].str.replace('<br>', ' ', regex=True).str.strip().replace('\\s+', ', ', regex=True) doesn't work because of sections like <br>DO Prem<br>, which will end of like DO, Prem, not DO Prem.
Split on <br> to make a list, then use a list comprehension to remove the '' spaces.
This will preserve spaces where they're supposed to be.
Join the list values back into a string with (', ').join([...])
import pandas as pd
df['Column2'] = df['Column1'].str.split('<br>').apply(lambda x: (', ').join([y for y in x if y != '']))
# output
Column1 Column2
.com<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> .com, Finance
.com<br><br>Finance<br><br><br><br><br>DO<br><br><br><br><br><br><br> .com, Finance, DO
<br><br>Finance<br><br><br>ISV<br><br>DO<br>DO Prem<br><br><br><br><br><br> Finance, ISV, DO, DO Prem
<br><br>Finance<br><br><br><br><br><br><br><br><br><br><br><br> Finance
<br><br>Finance<br><br><br>TTY<br><br><br><br><br><br><br><br><br> Finance, TTY
### Replace br with space
df['Column 2'] = df['column 1'].str.replace('<br>', ' ')
### Get rid of spaces before and after the string
df['Column 2'] = df['Column 2'].strip()
### Replace the space with ,
df['Column 2'] = df['Column 2'].str.replace('\\s+', ',', regex=True)
As pointed out by TrentonMcKinney, his solution is better. This one doesn't solve the issue when there is a space between the string values in Column 1

Python: How to print specific columns with cut some strings on one of the column reading csv

Am new to Python, hence apologize for basic question.
I've a csv file in the below mentioned format.
##cat temp.csv
Id,Info,TimeStamp,Version,Numitems,speed,Path
18699504331,NA/NA/NA,2017:01:01:13:40:31,3.16,6,781.2kHz,/home/user1
31287345804,NA/NA/NA,2017:01:03:14:35:04,3.16,2,111.5MHz,/home/user2
16360534162,NA/NA/NA,2017:01:02:21:39:51,3.16,3,230MHz,/home/user3
I wanted to read csv and print only specific column of interest and cut some strings in one of the column in a readable fashion, so i can use it.
Here is the python code:
cat temp.py
import csv
with open('temp.csv') as cvsfile:
readcsv = csv.reader(cvsfile, delimiter=',')
Id =[]
Info =[]
Timestamp =[]
Version =[]
Numitems =[]
Speed =[]
Path =[]
for row in readcsv:
lsfid = row[0]
modelinfo = row[1]
timestamp = row[2]
compilever = row[3]
numofavb = row[4]
frequency = row[5]
designpath = row[6]
Id.append(lsfid)
Info.append(modelinfo)
Timestamp.append(timestamp)
Version.append(compilever)
Numitems.append(numofavb)
Speed.append(frequency)
Path.append(designpath)
print(Id)
print(Info)
print(Timestamp)
print(Version)
print(Numitems)
print(Speed)
print(Path)
Output:
python temp.py
['Id', '18699504331', '31287345804', '16360534162', '18772620814', '18699504331', '31287345804', '16360534162']
['Info', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA', 'NA/NA/NA']
['TimeStamp', '2017:01:01:13:40:31', '2017:01:03:14:35:04', '2017:01:02:21:39:51', '2017:01:03:14:40:47', '2017:01:01:13:40:31', '2017:01:03:14:35:04', '2017:01:02:21:39:51']
['Version', '3.16', '3.16', '3.16', '3.16', '3.16', '3.16', '3.16']
['Numitems', '6', '2', '3', '2', '6', '2', '3']
['speed', '781.2kHz', '111.5MHz', '230MHz', '100MHz', '781.2kHz', '111.5MHz', '230MHz']
['Path', '/home/user1', '/home/user2', '/home/user3', '/home/user4', '/home/user5', '/home/user6', '/home/user7']
But what i wanted is in well organized look with my choice of column to be printed, something like below...
Id Info TimeStamp Version Numitems speed Path
18699504331 NA/NA/NA 2017:01:01:13:40:31 3.16 6 781.2kHz user1
31287345804 NA/NA/NA 2017:01:02:21:39:51 3.16 2 111.5MHz user2
31287345804 NA/NA/NA 2017:01:02:21:39:51 3.16 2 111.5MHz user3
Any help could be greatly appreciated!
Thanks in Advance
Velu.V
Check out numpy's genfromtxt function. You can use the use the usecols keyword argument to specify that you only want to read certain columns, see also https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html . For example lets say we have the following csv sheet:
col1 , col2 , col3
0.5, test, 0.3
0.7, test2, 0.1
Then,
import numpy as np
table=np.genfromtxt(f,delimiter=',',skip_header=0,dtype='S',usecols=[0,1])
will load the first two columns. You can then use the tabulate package ( https://pypi.python.org/pypi/tabulate) to nicely print out your table.
print tabulate(table,headers='firstrow')
Will look like:
col1 col2
------- --------
0.5 test
0.7 test2
Hope that answers your question

Python - creating a dictionary from large text file where the key matches regex pattern

My question: how do I create a dictionary from a list by assigning dictionary keys based on a regex pattern match ('^--L-[0-9]{8}'), and assigning the values by using all lines between each key.
Example excerpt from the raw file:
SQL> --L-93752133
SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;
SQL>
SQL> --L-52852243
SQL>
SQL> SELECT log_mode FROM v$database;
LOG_MODE
------------
NOARCHIVELOG
SQL>
SQL> archive log list
Database log mode No Archive Mode
Automatic archival Disabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 3
Current log sequence 5
SQL>
SQL> --L-42127143
SQL>
SQL> SELECT t.name "TSName", e.encryptionalg "Algorithm", d.file_name "File Name"
2 FROM v$tablespace t
3 , v$encrypted_tablespaces e
4 , dba_data_files d
5 WHERE t.ts# = e.ts#
6 AND t.name = d.tablespace_name;
no rows selected
Some additional detail: The raw file can be large (at least 80K+ lines, but often much larger) and I need to preserve the original spacing so the output is still easy to read. Here's how I'm reading the file in and removing "SQL>" from the beginning of each line:
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
Finding the dictionary keys I'm looking for is easy:
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
print(itemID.group(0))
But how do I assign all lines between each key as the value belonging to the most recent key preceding them? I've been playing around with new lists, tuples, and dictionaries but everything I do is returning garbage. The goal is to have the data and keys linked to each other so that I can return them as needed later in my script.
I spent a while searching for a similar question, but in most other cases the source file was already in a dictionary-like format so creating the new dictionary was a less complicated problem. Maybe a dictionary or tuple isn't the right answer, but any help would be appreciated! Thanks!
In general, you should question why you would read the entire file, split the lines into a list, and then iterate over the list. This is a Python anti-pattern.
For line oriented text files, just do:
with open(fn) as f:
for line in f:
# process a line
It sounds, however, that you have multi-line block oriented patterns. If so, with smaller files, read the entire file into a single string and use a regex on that. Then you would use group 1 and group 2 as the key, value in your dict:
pat=re.compile(pattern, flags)
with open(file_name) as f:
di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
With a larger file, use a mmap:
import re, mmap
pat=re.compile(pattern, flags)
with open(file_name, 'r+') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# process each block accordingly...
As far as the regex, I am a little unclear on what you are trying to capture or not. I think this regex is what I am understanding you want:
^SQL> (--L-[0-9]{8})(.*?)(?=SQL> --L-[0-9]{8}|\Z)
Demo
In either case, running that regex with the example string yields:
>>> pat=re.compile(r'^SQL> (--L-[0-9]{8})\s*(.*?)\s*(?=SQL> --L-[0-9]{8}|\Z)', re.S | re.M)
>>> with open(file_name) as f:
... di={m.group(1):m.group(2) for m in pat.finditer(f.read())}
...
>>> di
{'--L-52852243': 'SQL> \nSQL> SELECT log_mode FROM v;\n\n LOG_MODE\n ------------\n NOARCHIVELOG\n\nSQL> \nSQL> archive log list\n Database log mode No Archive Mode\n Automatic archival Disabled\n Archive destination USE_DB_RECOVERY_FILE_DEST\n Oldest online log sequence 3\n Current log sequence 5\nSQL>',
'--L-93752133': 'SQL> --SELECT table_name, tablespace_name from dba_tables where upper(table_name) like &tablename_from_developer;\nSQL>',
'--L-42127143': 'SQL> \nSQL> SELECT t.name TSName, e.encryptionalg Algorithm, d.file_name File Name\n 2 FROM v t\n 3 , v e\n 4 , dba_data_files d\n 5 WHERE t.ts# = e.ts#\n 6 AND t.name = d.tablespace_name;\n\n no rows selected'}
Something like this?
with open(rawFile, 'r') as inFile:
content = inFile.read()
rawList = content.splitlines()
keyed_dict = {}
in_between_lines = ""
last_key = 0
for line in rawList:
cleanLine = re.sub('^SQL> ', '', line)
pattern = re.compile(r'^--L-[0-9]{8}')
if pattern.search(cleanLine) is not None:
itemID = pattern.search(cleanLine)
if last_key: keyed_dict[last_key] = in_between_lines
last_key = itemID.group(0)
in_between_lines = ""
else:
in_between_lines += cleanLine

python replace string function throws asterix wildcard error

When i use * i receive the error
raise error, v # invalid expression
error: nothing to repeat
other wildcard characters such as ^ work fine.
the line of code:
df.columns = df.columns.str.replace('*agriculture', 'agri')
am using pandas and python
edit:
when I try using / to escape, the wildcard does not work as i intend
In[44]df = pd.DataFrame(columns=['agriculture', 'dfad agriculture df'])
In[45]df
Out[45]:
Empty DataFrame
Columns: [agriculture, dfad agriculture df]
Index: []
in[46]df.columns.str.replace('/*agriculture*','agri')
Out[46]: Index([u'agri', u'dfad agri df'], dtype='object')
I thought the wildcard should output Index([u'agri', u'agri'], dtype='object)
edit:
I am currently using hierarchical columns and would like to only replace agri for that specific level (level = 2).
original:
df.columns[0] = ('grand total', '2005', 'agriculture')
df.columns[1] = ('grand total', '2005', 'other')
desired:
df.columns[0] = ('grand total', '2005', 'agri')
df.columns[1] = ('grand total', '2005', 'other')
I'm looking at this link right now: Changing columns names in Pandas with hierarchical columns
and that author says it will get easier at 0.15.0 so I am hoping there are more recent updated solutions
You need to the asterisk * at the end in order to match the string 0 or more times, see the docs:
In [287]:
df = pd.DataFrame(columns=['agriculture'])
df
Out[287]:
Empty DataFrame
Columns: [agriculture]
Index: []
In [289]:
df.columns.str.replace('agriculture*', 'agri')
Out[289]:
Index(['agri'], dtype='object')
EDIT
Based on your new and actual requirements, you can use str.contains to find matches and then use this to build a dict to map the old against new names and then call rename:
In [307]:
matching_cols = df.columns[df.columns.str.contains('agriculture')]
df.rename(columns = dict(zip(matching_cols, ['agri'] * len(matching_cols))))
Out[307]:
Empty DataFrame
Columns: [agri, agri]
Index: []