Change the delimiter in AWS Glue Pyspark - amazon-web-services

abv_data = glueContext.create_dynamic_frame_from_options("s3", \
{'paths': ["s3://{}/{}".format(bucket, prefix)], \
"recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True},separator='\t')
abv_df_1 = abv_data.toDF()
abv_df_2 = abv_df_1.withColumn("save_date", lit(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S")))
conparms_r = glueContext.extract_jdbc_conf("reporting", catalog_id = None)
abv_df_2.write\
.format("com.databricks.spark.redshift")\
.option("url", "jdbc:redshift://rs_cluster:8192/rptg")\
.option("dbtable", redshift_schema_table_output)\
.option("user", conparms_r['user'])\
.option("password", conparms_r['password'])\
.option("aws_iam_role", "arn:aws:iam::123456789:role/redshift_admin_role")\
.option("tempdir", args["TempDir"])\
.option("extracopyoptions","DELIMITER '\t' IGNOREHEADER 1 DATEFORMAT AS 'YYYY-MM-DD'")\
.mode("append")\
.save()
The csv has a tab delimiter on read, but when I add the column to the dataframe is uses a comma delimiter and is causing the Redshift load to fail.
Is there a way to add the column with a tab delimiter OR change the delimiter on the entire data frame?

This isn't necessarily the way to do this, but here is what I ended up doing:
bring the csv in with a ',' separator.
glueContext.create_dynamic_frame_from_options("s3", \
{'paths': ["s3://{}/{}".format(bucket, prefix)], \
"recurse":True, 'groupFiles': 'inPartition'},"csv",{'withHeader':True}, separator = ',')
Then split the first column on tab and then add all the splits to their own column and add the extra column at the same time.
Drop the first column because it is still the combined column.
This gives you a comma seperated df to load.

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.
For, special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

Related

Redshift Copy with Newline Embedded in Quotes

Trying to copy data from S3 to Redshift with a newline with in quotes
Example CSV file:
Line 1 --> ID,Description,flag
Line 2 --> "1111","this is a test", "FALSE"
Line 3 --> "2222","I hope someone
could help", "TRUE"
Line 4 --> "3333", "NA", "FALSE"
Sample Table:
TEST_TABLE:
ID VARCHAR(100)
DESCRIPTION VARCHAR(100)
FLAG VARCHAR(100)
If you notice in line 2 there is a linefeed and I get the error Delimited value missing end quote when using the COPY command.
This is the Copy command I use:
copy table_name
from sample.csv
credentials aws_access_key_id= blah; aws_secret_access_key=blah
DELIMITER ','
removequotes
trimblanks
ESCAPE ACCEPTINVCHARS
EMPTYASNULL
IGNOREHEADER 1
COMPUPDATE OFF;
commit;
I've also tried the CSV option, but get "Extra column(s) found ":
copy table_name
from sample.csv
credentials aws_access_key_id= blah; aws_secret_access_key=blah
CSV
IGNOREHEADER 1
COMPUPDATE OFF;
commit;
I would expect the description column in Line 2 to be loaded with the linefeed.
Since the field is delimited by quotes, use the CSV option.
Note: CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.

Using AWS Athena to query one line from csv file in s3 to query and export list

I need to select only one line, the last line from many multiple line csv files and add them to a table in aws athena, and then export them to a csv as a whole list.
I am trying to collect data from many sources and the csv files are updated weekly but I only need one line from each file. I have used the standard import to athena and it imports all lines from the selected csv's in the bucket but I need only the last line of each, so that i have the most resent data from that file.
CREATE EXTERNAL TABLE IF NOT EXISTS inventory.laptops (
`date` string,
`serialnum` string,
`biosver` string,
`machine` string,
`manufacturer` string,
`model` string,
`win` string,
`winver` string,
`driveletter` string,
`size` string,
`macaddr` string,
`domain` string,
`ram` string,
`processor` string,
`users` string,
`fullname` string,
`location` string,
`lastconnected` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'quoteChar' = '"',
'field.delim' = ','
) LOCATION 's3://my-s3-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1");
I need the last line from each csv file in the s3 but I get every line using this creation query
Yes, CREATE TABLE defines how to read the file. You will need to craft a SELECT statement to retrieve the desired line. You will need to use some identifier in the file that can indicate the last line, such as having the latest date.
For example, if the last line always has the most recent date, you could use:
SELECT *
FROM inventory.laptops
ORDER BY date
LIMIT 1
If there is no field that can be used to identify the last line, you might need to cheat by finding out the number of lines in the file, then skipping over all but the last line using skip.header.line.count.
Normally, the order of rows in a file is unimportant.
So this is impossible but you can create a lambda function to concatenate the last line of multiple csv files in a bucket directory and print to a single csv and then import it to athena for querying. I used python to solve this.
import logging
import boto3 ,os
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
data = ''
# retrieve bucket name and file_key from the S3 event
bucket_name = os.environ['s3_bucket']
# get the object
obj_list = s3.list_objects_v2(Bucket = bucket_name, Prefix = 'bucket prefix')
x = 0
for object in obj_list['Contents']:
obj = s3.get_object(Bucket=bucket_name, Key=object['Key'])
# get lines inside the csv
lines = obj['Body'].read().split(b'\n')
f = 0
for r in lines:
f += 1
#Reads the number of lines in the file
b = 0
for r in lines:
if x < 1:
x +=1
if b == 0:
header = (r.decode())
data +=(header)
b += 1
if b == f-1:
data += (r.decode())
s3.put_object(Bucket=bucket_name, Key='Concat.csv', Body=data)

Load array field in csv data file into Athena table

This is a sample row in input data file with two fields - dept and names
dept,names
Mathematics,[foo,bar,alice,bob]
Here, 'name' is an array of String and I want to load it as array of String Athena.
Any suggestion?
To have a valid CSV file, make sure you put quotes around your array:
Mathematics,"[foo,bar,alice,bob]"
If you can remove the "[" and "]" the solution below becomes even easier and you can just split without the regex.
Better: Mathematics,"foo,bar,alice,bob"
First create a simple table from CSV with just strings:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = ',',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Then create a view which uses a regex to remove your '[' and ']' characters, then splits the rest by ',' into an array.
CREATE OR REPLACE VIEW mydataview AS
SELECT dept,
split(regexp_extract(names, '^\[(.*)\]$', 1), ',') as names
FROM mydataset
Then use the view for your queries. I am not 100% sure as I've only spent like 12 hours using Athena.
--
Note that in order to use the quotes, you need to use OpenCSVSerde, the 'lazyserde' won't work as it does support quotes. lazyserde DOES support internal arrays, but you can't use the ',' as a separator in that case. If you want to try that, your data would look like:
Better: Mathematics,foo|bar|alice|bob
In that case this MIGHT work directly:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = '|',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Note how collection.delim = '|', which should translate your field directly to an array.
Sorry I don't have time to test this, I'll be happy to update my answer if you can confirm what works. Hopefully this get's you started.

Redshift - Load data which has newline in field

I am trying to load the data that includes a new line within a field:
001|myname|fav\
movie | myaddress| myphone|
There is a blank line between fav\movie.
I am loading the data with this command:
COPY catdemo
FROM 's3://tickit/catego.csv'
IAM_ROLE 'arn:aws:iam::<aws-account-id>:role/<role-name>'
REGION 'ap-south-1'
DELIMITER '|'
ESCAPE
ACCEPTINVCHARS
IGNOREBLANKLINES
NULL AS '\0';
I want to ignore this blank line, can anyone help me?
its showing delimiter not found between fav\ and movie, but its actually a single line.
fav\
movie

Updating rrdtool database

My first post here so I hope I have not been too verbose.
I found I was losing datapoints due to only having 10 rows in my rrdtool config and wanted to update from a backup source file with older data.
After fixing the rows count the config was created with:
rrdtool create dailySolax.rrd \
--start 1451606400 \
--step 21600 \
DS:toGrid:GAUGE:172800:0:100000 \
DS:fromGrid:GAUGE:172800:0:100000 \
DS:totalEnerg:GAUGE:172800:0:100000 \
DS:BattNow:GAUGE:1200:0:300 \
RRA:LAST:0.5:1d:1010 \
RRA:MAX:0.5:1d:1010 \
RRA:MAX:0.5:1M:1010
and the update line in python is
newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = 'N:'+ (newline)
print UpdateE
try:
rrdtool.update(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
This all worked fine for inputting the original data (from a crontabbed website scrape) but as I said I lost data and wanted to add back the earlier datapoints.
From my backup source I had a plain text file with lines looking like
1509386401:10876.9:3446.22:18489.2:19.0
1509408001:10879.76:3446.99:18495.7:100.0
where the first field is the timestamp. And then used this code to read in the lines for the updates:
with open("rrdRecovery.txt","r") as fp:
for line in fp:
print line
## newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = line
try:
rrdtool.updatev(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
When it did not work correctly with a copy of the current version of the database I tried again on an empty database created using the same config.
In each case the update results only in the timestamp data in the database and no data from the other fields.
Python is not complaining and I expected
1509386401:10876.9:3446.22:18489.2:19.0
would update the same as does
N:10876.9:3446.22:18489.2:19.0
The dump shows the lastupdate data for all fields but then this for the rra database
<!-- 2017-10-31 11:00:00 AEDT / 1509408000 --> <row><v>NaN</v><v>NaN</v><v>NaN</v><v>NaN</v></row>
Not sure if I have a python issue - more likely a rrdtool understanding problem. Thanks for any pointers.
The problem you have is that RRDTool timestamps must be increasing. This means that, if you increase the length of your RRAs (back into the past), you cannot put data directly into these points - only add new data onto the end as time increases. Also, when you create a new RRD, the 'last update' time defaults to NOW.
If you have a log of your previous timestamp, then you should be able to add this history, as long as you don't do any 'now' updates before you finish doing so.
First, create the RRD, with a 'start' time earlier than the first historical update.
Then, process all of the historical updates in chronological order, with the appropriate timestamps.
Finally, you can start doing your regular 'now' updates.
I suspect what has happened is that you had your regular cronjob adding in new data before you have run all of your historical data input - or else you created the RRD with a start time after your historical timestamps.