Error when casting pyarrow table to custom schema - casting

I am trying to write a dataframe to pyrarrow table and then casting this pyarrow table to a custom schema. I can read the dataframe to pyarrow table but when I cast it to custom schema I run into an error.
Some context: filesCsv is a list of csv filenames(with absolute path)
new_schema = pa.schema([
pa.field('termination_date',pa.timestamp("us")),
pa.field('Employee_ID',pa.string()),
pa.field('Full_Legal_Name', pa.string()),
pa.field('First_Name', pa.string()),
pa.field('Middle_Name', pa.string()),
pa.field('Last_Name', pa.string()),
pa.field('businessTitle', pa.string()),
pa.field('primaryWorkEmail', pa.string()),
pa.field('Supervisory_Organization', pa.string()),
pa.field('Primary_Work_Mobile_Phone', pa.string()),
pa.field('Cost_Center', pa.string()),
pa.field('date_extracted', pa.timestamp("us"))]
)
timestmp='00:00:00'
for file in filesCsv:
filenamep=file.split('_') ## to reformat the date and remove timestamp from file.This
date will help in hive partitioning of dataset
filenamep.pop(-1)
hiveDate=filenamep[-1]
##print(hiveDate)
hiveDate=hiveDate[:4]+'-'+hiveDate[4:6]+'-'+hiveDate[6:]+ ' '+timestmp ###creates a dates sting as "YYYY-MM-DD"
##print(hiveDate)
hiveDate=datetime.datetime.strptime(hiveDate,'%Y-%m-%d %H:%M:%S') ###'%Y-%m-%d %H:%M:%S'
##print(type(hiveDate))
df_csv=pd.read_csv(file)
##print(type(df_csv))
##print(type(df_csv))
date_extracted=pd.Series(hiveDate) ### dtype='datetime64[ns]' pd.Series takes list as an argument. We will use pd.Series as the length of the list is less than columns in the dataframe
##print(date_extracted)
##print(type(date_extracted))
df_csv.loc[:,'date_extracted']=pd.Series(date_extracted)
df_csv['date_extracted'].ffill(inplace=True)
df_csv.columns
print(df_csv.dtypes) ## list columns and data types
## print(type(df_csv))
pq_tbl2=pa.Table.from_pandas(df_csv)
pq_tbl2.schema
pq_tbl2=pq_tbl2.cast(target_schema=new_schema)
##pq_file_name=''.join(filenamep)+'.parquet'
##print(pq_file_name)
##pq.write_table(pq_tbl2,pq_file_name, use_deprecated_int96_timestamps=True)
I am getting this strange error. I have tried casting from the parquet file read into pyarrow table and see no error. I want to directly do this step from CSV file to dataframe to pyarrow table
ValueError: Target schema's field names are not matching the table's field names: ['401 : invalid username or password', 'date_extracted'], ['termination_date', 'Employee_ID', 'Full_Legal_Name', 'First_Name', 'Middle_Name', 'Last_Name', 'businessTitle', 'primaryWorkEmail', 'Supervisory_Organization', 'Primary_Work_Mobile_Phone', 'Cost_Center', 'date_extracted']

Related

Informatica - SQL query to extract data from JSON format value

My source is Oracle table 'SAMPLE' which contain the JSON value in the column REF_VALUE which becomes as a source in Informatica mapping. The data looks as below
PROPERTY | REF_VALUE
CTMappings | {CTMappings": [
{"CTId":"ABCDEFGHI_T2","EG":"ILTS","Type":"K1J10200"},
{"CTId":"JKLMNOPQR_T1","EG":"LAM","Type":"K1J10200"}
"}]}"
I have the SQL query to explore the JSON Data into rows as below
select
substr(CTId,1,9) as ID,
substr(CTId,9,11) as VERSION,
type as INTR_TYP,
eg as ENTY_ID,
from
(
select jt.*
from SAMPLE,
JSON_TABLE (ref_value, '$.Mappings[*]'
columns (
CTId varchar2(50) path '$.formId',
eg varchar2(50) path '$.eg',
type varchar2(50) path '$.Type'
)) jt
where trim(pref_property) = 'Ellipse.Integration.FSM.FormMappings')
The result of table as below
CTiD VERSION EG Typ
======== ======= ==== ======
ABCDEFGHI T2 ILTS K1J102001
KLMNOPQR T1 LAM K1J102000
which is required as an output ( JSON into rows)
Im new to Informatica so I have used source table as 'SAMPLE' and further I want to use this query to extract the data into row format in Informatica but I dont know how to proceed. I have the image please refer
If anyone answers quickly, it will be a great help
Declare the fields in Source Qualifier and use SQL Override property to place your query - that should do the trick.

AWS athena query gzip format, missing column name in first row

I am generating a gz file that is csv by using below query
CREATE TABLE newgziptable4
WITH (
format = 'TEXTFILE',
write_compression = 'GZIP',
field_delimiter = ',',
external_location = 's3://bucket1/reporting/gzipoutputs4'
) AS
select name , birthdate from "myathena_table";
I am following this link on aws website
The issue is that if i just generate CSV, I see the column names as the first row in output csv. But when I use the above method, I do not see the column names name and birthdate. How can I ensure that I get those as well in the gz

Parse timestamp in Hive during table creation

I have a file that looks like this:
33.49.147.163 20140416123526 https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
I want to load it into a hive table. I do it this way:
create external table Logs (
ip string,
ts timestamp,
request string,
page_size smallint,
status_code smallint,
info string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties (
"timestamp.formats" = "yyyyMMddHHmmss",
"input.regex" = '^(\\S*)\\t{3}(\\d{14})\\t(\\S*)\\t(\\S*)\\t(\\S*)\\t(\\S*).*$'
)
stored as textfile
location '/data/user_logs/user_logs_M';
And
select * from Logs limit 10;
results in
33.49.147.16 NULL https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en 29 409 Firefox/5.0
How to parse timestamps correctly, to avoid this NULLs?
"timestamp.formats" SerDe property works only with LazySimpleSerDe (STORED AS TEXTFILE), it does not work with RegexSerDe. If you are using RegexSerDe, then parse timestamp in a query.
Define ts column as STRING data type in CREATE TABLE and in the query transform it like this:
select timestamp(regexp_replace(ts,'(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})','$1-$2-$3 $4:$5:$6.0')) as ts
Of course, you can extract each part of the timestamp using SerDe as separate columns and properly concatenate them with delimiters in the query to get correct timestamp format, but it will not give you any improvement because anyway you will need additional transformation in the query.

AWS Glue dynamic frame - no column headers if no data

I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()

Import .csv file to PostgreSQL and add an autoincrementing ID in the first column

i have downloaded a csv file for testing purposes and would like to upload all of the data to postgresql Database. However, i need to have an autoincrementing ID as the first column of the DB. Initially, i created the DB with SQL Query:
CREATE TABLE pps3
( id integer NOT NULL DEFAULT
nextval('products_product_id_seq'::regclass),
"brandname" character varying(25),
"type1" integer,
"type2" integer,
"type3" integer,
"Total" integer )
CSV data:
"brandname","type1","type2","type3","Total"
"brand1","0","0","32","32"
"brand1","0","12","0","12"
I tried to move the data from the CSV with this code:
import csv
import psycopg2
conn = psycopg2.connect("host=localhost dbname=my_django_db user=postgres")
cur = conn.cursor()
with open('PPS-Sep.csv', 'r') as f:
reader = csv.reader(f)
next(reader) # Skip the header row.
for row in reader:
cur.execute(
"INSERT INTO pps3 VALUES (%s, %s, %s, %s,%s)",row)
conn.commit()
This is working fine if I do not create the initial ID column.
However, if I run it like that I get an error message that I am trying to insert the brandname to the ID.
Any ideas on how to go around this?
Try change:
INSERT INTO pps3 VALUES (%s, %s, %s, %s)
to
INSERT INTO pps3(type1, type2, type3, Total) VALUES (%s, %s, %s, %s)
When using INSERT INTO without providing columns postgres used all columns from table in original order.