I am trying to use this Django reference [Streaming large CSV files][1]
[1]: https://docs.djangoproject.com/en/2.2/howto/outputting-csv/#streaming-large-csv-files to download a pandas dataframe as csv file.
It requires a generator.
# Generate a sequence of rows. The range is based on the maximum number of
# rows that can be handled by a single sheet in most spreadsheet
# applications.
rows = (["Row {}".format(idx), str(idx)] for idx in range(65536))
if I have a dataframe called my_df (with 20 columns, and 10000 rows) .... how do I revise this logic to use my_df instead of generating numbers as in the example.
Something like:
response = HttpResponse(content_type='text/csv') # Format response as a CSV
filename = 'some_file_name.csv'
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'# Name the CSV response
my_df.to_csv(response, encoding='utf-8', index=False)
return response
Related
I have two csv files test1.csv and test2.csv that contain two rows with values (altitude,time).
test1.csv is quite larger that test2.csv.
I want to compare the altitudes based on the same time
I have found this piece of code that runs on Python2
import csv
with open('test1.csv', 'rb') as master:
master_indices = dict((r[0], i) for i, r in enumerate(csv.reader(master)))
with open('test2.csv', 'rb') as hosts:
with open('results.csv', 'wb') as results:
reader = csv.reader(hosts)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['result'])
for row in reader:
index = master_indices.get(row[0])
if index is not None:
message = 'Same time is found (row {})'.format(index)
else:
message = 'No same time is found'
writer.writerow(row + [message])
and it works fine as it writes the index from time1.csv that was found the same.
The result csv contains the time and altitude of test2.csv and also the message that show when there is match on time value or not.
Since I'm quite new to Python I'm trying to find away so that the results.csv file contains also the altitude column from test1.csv.
I tried to replicated the above code for the test1.csv file in order to add the row by adding the following code to the existing:
with open('test1.csv', 'rb') as master:
with open('results.csv', 'wb') as results:
writer = csv.writer(results)
reader2 = csv.reader(master)
writer.writerow(next(reader2, []) + ['altitude'])
for row in reader2:
writer.writerow(row)
But I got a csv file without the previous result column and an new but empty altitude column.
So eventually the result.csv should contain the following columns:
time,altitude(from test2.csv),altitude(from test1.csv),result
How can this be achieved?
I am trying to read a csv file that is in my S3 bucket. I would like to do some manipulations and then finally convert to a dynamic dataframe and write it back to S3.
This is what I have tried so far:
Pure Python:
Val1=""
Val2=""
cols=[]
width=[]
with open('s3://demo-ETL/read/data.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
print(row)
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here I get an error that says it cannot find the file in the directory at all.
Boto3:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
contents = data['Body'].read()
print(contents)
for row in content:
if ((Val1=="" ) & (Val2=="")):
Val1=row[0]
Val2=row[0]
cols.append(row[1])
width.append(int(row[4]))
else:
continues...
Here it says index is out of range which is strange because I have 4 comma separated values in the csv file. When I look at the results from the print(contents), I see that its putting each character in a list, instead of it putting each comma separated value in a list.
Is there a better way to read the csv from s3?
I ended up solving this by reading it as a pandas dataframe. I first created an object with boto3, then read the whole object as a pd which I then converted into a list.
s3 = boto3.resource('s3')
bucket = s3.Bucket('demo-ETL')
obj = bucket.Object(key='read/data.csv')
dataFrame = pd.read_csv(obj.get()['Body'])
l = dataFrame.values.tolist()
for i in l:
print(i)
get_object returns the Body response value which is of type StreamingBody. Per the docs, if you're trying to go line-by-line you probably want to use iter_lines.
For example:
import boto3
s3 = boto3.client('s3')
data = s3.get_object(Bucket='demo-ETL', Key='read/data.csv')
file_lines = data['Body'].iter_lines()
print(file_lines)
This probably does more of what you want.
You can use Spark to read the file like this:
df = spark.read.\
format("csv").\
option("header", "true").\
load("s3://bucket-name/file-name.csv")
You can find more options here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv
I need to select only one line, the last line from many multiple line csv files and add them to a table in aws athena, and then export them to a csv as a whole list.
I am trying to collect data from many sources and the csv files are updated weekly but I only need one line from each file. I have used the standard import to athena and it imports all lines from the selected csv's in the bucket but I need only the last line of each, so that i have the most resent data from that file.
CREATE EXTERNAL TABLE IF NOT EXISTS inventory.laptops (
`date` string,
`serialnum` string,
`biosver` string,
`machine` string,
`manufacturer` string,
`model` string,
`win` string,
`winver` string,
`driveletter` string,
`size` string,
`macaddr` string,
`domain` string,
`ram` string,
`processor` string,
`users` string,
`fullname` string,
`location` string,
`lastconnected` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'quoteChar' = '"',
'field.delim' = ','
) LOCATION 's3://my-s3-bucket/'
TBLPROPERTIES ('has_encrypted_data'='false',"skip.header.line.count"="1");
I need the last line from each csv file in the s3 but I get every line using this creation query
Yes, CREATE TABLE defines how to read the file. You will need to craft a SELECT statement to retrieve the desired line. You will need to use some identifier in the file that can indicate the last line, such as having the latest date.
For example, if the last line always has the most recent date, you could use:
SELECT *
FROM inventory.laptops
ORDER BY date
LIMIT 1
If there is no field that can be used to identify the last line, you might need to cheat by finding out the number of lines in the file, then skipping over all but the last line using skip.header.line.count.
Normally, the order of rows in a file is unimportant.
So this is impossible but you can create a lambda function to concatenate the last line of multiple csv files in a bucket directory and print to a single csv and then import it to athena for querying. I used python to solve this.
import logging
import boto3 ,os
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
data = ''
# retrieve bucket name and file_key from the S3 event
bucket_name = os.environ['s3_bucket']
# get the object
obj_list = s3.list_objects_v2(Bucket = bucket_name, Prefix = 'bucket prefix')
x = 0
for object in obj_list['Contents']:
obj = s3.get_object(Bucket=bucket_name, Key=object['Key'])
# get lines inside the csv
lines = obj['Body'].read().split(b'\n')
f = 0
for r in lines:
f += 1
#Reads the number of lines in the file
b = 0
for r in lines:
if x < 1:
x +=1
if b == 0:
header = (r.decode())
data +=(header)
b += 1
if b == f-1:
data += (r.decode())
s3.put_object(Bucket=bucket_name, Key='Concat.csv', Body=data)
I have a folder with several 20 million record tab delimited files in it. I would like to create a pandas dataframe where I randomly sample say 20 thousand records from each file, and then append them together in the dataframe. Does anyone know how to do that?
You could read in all the text files in a particular folder. Then you could make use of pandas Dataframe.sample (link to docs).
I've provided a fully reproducible example with two example .txt file created with 200 rows. I then take a random sample of ten rows and append the sample to a final datframe.
import pandas as pd
import numpy as np
import glob
# Change the path for the directory
directory = r'C:\some\example\folder'
# I create two test .txt files for demonstration purposes with 200 rows each
df_test = pd.DataFrame(np.random.randn(200, 2), columns=list('AB'))
df_test.to_csv(directory + r'\test_1.txt', sep='\t', index=False)
df_test.to_csv(directory + r'\test_2.txt', sep='\t', index=False)
df = pd.DataFrame()
for filename in glob.glob(directory + r'\*.txt'):
df_full = pd.read_csv(filename, sep='\t')
df_sample = df_full.sample(n=10)
df = df.append(df_sample)
I have two parameters like filename and time and I want to write them in a column in a csv file. These two parameters are in a for-loop so their value is changed in each iteration.
My current python code is the one below but the resulting csv is not what I want:
import csv
import os
with open("txt/scalable_decoding_time.csv", "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
filename = ["one","two", "three"]
time = ["1","2", "3"]
zipped_lists = zip(filename,time)
for row in zipped_lists:
print row
writer.writerow(row)
My csv file must be like below. The , must be the delimeter. So I must get two columns.
one, 1
two, 2
three, 3
My csv file now reads as the following picture. The data are stored in one column.
Do you know how to fix this?
Well, the issue here is, you are using writerows instead of writerow
import csv
import os
with open("scalable_decoding_time.csv", "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
level_counter = 0
max_levels = 3
filename = ["one","two", "three"]
time = ["1","2", "3"]
while level_counter < max_levels:
writer.writerow((filename[level_counter], time[level_counter]))
level_counter = level_counter +1
This gave me the result:
one,1
two,2
three,3
Output:
This is another solution
Put the following code into a python script that we will call sc-123.py
filename = ["one","two", "three"]
time = ["1","2", "3"]
for a,b in zip(filename,time):
print('{}{}{}'.format(a,',',b))
Once the script is ready, run it like that
python2 sc-123.py > scalable_decoding_time.csv
You will have the results formatted the way you want
one,1
two,2
three,3