{
"completed_in":0.031,
"max_id":122078461840982016,
"max_id_str":"122078461840982016",
"indices":[
37,
57
]
}
How to extract indices values into a dataset?
Thanks in advance
Related
I'm trying to recode a list of values using Pyspark to create a new column. I've set my mapping up with nested dictionaries, but can't get the mapping syntax figured out. The original data has several string values that need to get recoded to a new value, then I want to give the column a new name. The original column values will get grouped several different ways to create different new columns.
The df will have several thousand columns, so I need the code to be as efficient as possible.
I have a different scenario with a 1-1 mapping where I was able to create my expression with:
#expr = [ create_map([lit(x) for x in chain(*values.items())])[orig_df[key]].cast(IntegerType()).alias('new_name') for key, values in my_dict.items() if key in orig_df.columns]
I just can't figure out the syntax for mapping the many to one.
Here's what I've tried:
grouping_dict = {'orig_col_n':{'new_col_n_a': {'20':['011','012'.'013'],'30':['014','015','016']},
'new_col_n_b': {'25':['011','013','015'],'35':['012','014','016']}}}
expr = [ f.when(f.col(key) == f.lit(old_val),f.lit(new_value))
.cast(IntegerType())
.alias(new_var_name)
for key, new_var_names_dict in grouping_dict.items()
for new_var_name,mapping_dict in new_var_names_dict.items()
for new_value,old_value_list in mapping_dict.items()
for old_val in old_value_list
if key in original_df.columns]
new_df = original_df.select(*expr)
This expression isn't quite right, it creates multiple columns with the same name as it loops through the values that need to be mapped.
Any suggestions for restructuring my dictionary or how to fix my syntax would be greatly appreciated!
enter image description here
enter image description here
orig_col_n new_col_n_a new_col_n_b
011 20 25
012 20 35
013 20 25
014 30 35
015 30 25
016 30 35
I am using python API of SAS, and have uploaded a table by:
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
I can see the details of the table by s.tableinfo().
§ TableInfo
Name Rows Columns IndexedColumns Encoding CreateTimeFormatted ModTimeFormatted AccessTimeFormatted JavaCharSet CreateTime ... Repeated View MultiPart SourceName SourceCaslib Compressed Creator Modifier SourceModTimeFormatted SourceModTime
0 HMEQ 5960 13 0 utf-8 2020-02-10T16:48:02-05:00 2020-02-10T16:48:02-05:00 2020-02-10T21:10:34-05:00 UTF8 1.896990e+09 ... 0 0 0 0 aforoo 2020-02-10T16:48:02-05:00 1.896990e+09
1 rows × 23 columns
But, I cannot access any value of the table in python. For example, assume I want to get the number of rows and columns as a python scalar. I know that I can get the SAS tables into pandas tables by using pd.DataFrame, but it does not work for this table and I get:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
346 dtype=dtype, copy=copy)
347 elif isinstance(data, dict):
--> 348 mgr = self._init_dict(data, index, columns, dtype=dtype)
349 elif isinstance(data, ma.MaskedArray):
350 import numpy.ma.mrecords as mrecords
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
457 arrays = [data[k] for k in keys]
458
--> 459 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
460
461 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
7354 # figure out the index, if necessary
7355 if index is None:
-> 7356 index = extract_index(arrays)
7357
7358 # don't force copy because getting jammed in an ndarray anyway
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in extract_index(data)
7391
7392 if not indexes and not raw_lengths:
-> 7393 raise ValueError('If using all scalar values, you must pass'
7394 ' an index')
7395
ValueError: If using all scalar values, you must pass an index
I have same issue with any other casout table in SAS. I appreciate any help or comment.
I would suggest you use directly Pandas to read from SAS.
Reference from another answer: Read SAS file with pandas
Here is another example
https://www.marsja.se/how-to-read-sas-files-in-python-with-pandas/
I found the solution below and it works fine. For example, here I have used dataSciencePilot.exploreData action and I can get the results by:
casout = dict(name = 'out1', replace=True)
s.dataSciencePilot.exploreData(table=tbl_name, target='bad', casout=casout)
fetch_opts = dict(maxrows=100000000, to=1000000)
df = s.fetch(table='out1', **fetch_opts)['Fetch']
features = pd.DataFrame(df)
type(features)
which returns pandas.core.frame.DataFrame.
I have a Pandas Dataframe with data as below
id, name, date
[101],[test_name],[2019-06-13T13:45:00.000Z]
[103],[test_name3],[2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z]
[104],[],[]
I am trying to convert it to a format as below with no square brackets
Expected output:
id, name, date
101,test_name,2019-06-13T13:45:00.000Z
103,test_name3,2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00.000Z
104,,
I tried using regex as below but it gave me an error TypeError: expected string or bytes-like object
re.search(r"\[([A-Za-z0-9_]+)\]", df['id'])
Figured I am able to extract the data using the below:
df['id'].str.get(0)
Loop through the data frame to access each string then use:
newstring = oldstring[1:len(oldstring)-1]
to replace the cell in the dataframe.
Try looping through columns:
for col in df.columns:
df[col] = df[col].str[1:-1]
Or use apply if your duplication of your data is not a problem:
df = df.apply(lambda x: x.str[1:-1])
Output:
id name date
0 101 test_name 2019-06-13T13:45:00.000Z
1 103 test_name3 2019-06-14T13:45:00.000Z, 2019-06-14T17:45:00....
2 104
Or if you want to use regex, you need str accessor, and extract:
df.apply(lambda x: x.str.extract('\[([A-Za-z0-9_]+)\]'))
I'm trying to read the discharge data of 346 US rivers stored online in textfiles. The files are more or less in this format:
Measurement_number Date Gage_height Discharge_value
1 2017-01-01 10 1000
2 2017-01-20 15 2000
# etc.
I only want to read the gage height and discharge value columns.
The problem is that in most files additional columns with metadata are added in front of the 'Gage height' column, so i can not just simply read the 3rd and 4th column because their index varies.
I'm trying to find a way to say 'read the columns with the name 'Gage_height' and 'Discharge_value'', but I haven't succeeded yet.
I hope anyone can help. I'm currently trying to load the text files with numpy.genfromtxt so it would be great to find a solution with that package but other solutions are also more than welcome.
This is my code so far
data_url=urllib2.urlopen(#the url of this specific site)
data=np.genfromtxt(data_url,skip_header=1,comments='#',usecols=2,3])
You can use the names=True option to genfromtxt, and then use the column names to select which columns you want to read with usecols.
For example, to read 'Gage_height' and 'Discharge_value' from your data file:
data = np.genfromtxt(filename, names=True, usecols=['Gage_height', 'Discharge_value'])
Note that you don't need to set skip_header=1 if you use names=True.
You can then access the columns using their names:
gage_height = data['Gage_height'] # == array([ 10., 15.])
discharge_value = data['Discharge_value'] # == array([ 1000., 2000.])
See the docs here for more information.
Suppose i have tabular column as below.Now i want to extract the column wise data.I tried extracting data by creating a list.But it is extracting the first row correctly but from second row onwards there is space i.e under CEN/4.Now my code considers zeroth column has 5.0001e-1 form second row,it starts reading from there. How to extract the data correctly coulmn wise.output is scrambled.
0 1 25 CEN/4 -5.000000E-01 -3.607026E+04 -5.747796E+03 -8.912796E+02 -88.3178
5.000000E-01 3.607026E+04 5.747796E+03 8.912796E+02 1.6822
27 -5.000000E-01 -3.641444E+04 -5.783247E+03 -8.912796E+02 -88.3347
5.000000E-01 3.641444E+04 5.783247E+03 8.912796E+02 1.6653
28 -5.000000E-01 -3.641444E+04 -5.712346E+03 -8.912796E+02 -88.3386
5.000000E-01 3.641444E+04 5.712346E+03 8.912796E+02
my code is :
f1=open('newdata1.txt','w')
L = []
for index, line in enumerate(open('Trial_1.txt','r')):
#print index
if index < 0: #skip first 5 lines
continue
else:
line =line.split()
L.append('%s\t%s\t %s\n' %(line[0], line[1],line[2]))
f1.writelines(L)
f1.close()
my output looks like this:
0 1 CEN/4 -5.000000E-01 -5.120107E+04
5.000000E-01 5.120107E+04 1.028093E+04 5.979930E+03 8.1461
i want columnar data as it is in the file.How to do that.I am a bgeinner
its hard to tell from the way the input data is presented in your question, but Im guessing your file is using tabs to separate columns, in any case, consider using python csv module with the relevant delimiter like:
import csv
with open('input.csv') as f_in, open('newdata1', 'w') as f_out:
reader = csv.reader(f_in, delimiter='\t')
writer = csv.writer(f_out, delimiter='\t')
for row in reader:
writer.writerow(row)
see python csv module documentation for further details