How to compress / decompress a serialized Pandas Dataframe with PyArrow? - compression

I am using Redis to store a Pandas dataframe. I am using PyArrow for serialization and would like to add compression.
I can serialize/deserialize dataframes with no problems. I can also compress the serialized dataframe. However, I cannot seem to decompress it.
When I try to decompress, I get: ValueError: Must pass decompressed_size for lz4 codec
So, I add the size of the object and get: ArrowIOError: Corrupt Lz4 compressed data.
Thinking it might be a problem with Pandas dataframes, I tried using a simple text string but got the same result. I thought perhaps it was a problem with the lz4 codec but the errors occur with 'gzip' also. Any help would be very much appreciated.
import pandas
import pyarrow
import sys
df = pandas.DataFrame({'A':[1,2,3],'B':[4,5,6]})
ser = pyarrow.serialize(df).to_buffer()
comp = pyarrow.compress(ser,asbytes=True)
dec = pyarrow.decompress(comp)
# Gives ValueError: Must pass decompressed_size for lz4 codec
siz = sys.getsizeof(ser) #siz = 56
dec = pyarrow.decompress(comp,decompressed_size = siz)
#Gives ArrowIOError: Corrupt Lz4 compressed data.

sys.getsizeof provides the wrong size. The following code round trips:
import pandas
import pyarrow
import sys
df = pandas.DataFrame({'A':[1,2,3],'B':[4,5,6]})
ser = pyarrow.serialize(df).to_buffer()
comp = pyarrow.compress(ser,asbytes=True)
siz = len(ser) #siz = 3912
dec = pyarrow.decompress(comp,decompressed_size = siz)
pyarrow.deserialize(dec)

Related

Unable to import numpy array properly from a text file

I want to convert the string data type to numpy array of 2-D.
I'm importing a .txt file from a directory which contains:
[[18,1,2018,12,15],
[07,1,2018,12,15],
[03,1,2018,12,15]]
and the code is:
import numpy as np
f = open("/home/pi/timer_database.txt","r")
read = f.read()
x = np.array(list(read))
print(x.size)
print(type(x))
print(x.ndim)
The output is :
47
type <numpy.ndarray>
1
Please help me in this issue.
Use This code
import numpy as np
f = open("/home/pi/timer_database.txt","r")
read = f.read()
read = read.replace("[" , "")
read = read.replace("]" , "")
read = read.replace(",\n" , "\n")
f= open("New_Array.txt","w+")
f.write(read)
f.close()
Array = np.loadtxt("New_Array.txt" , delimiter=',')
print(Array)
You can use ast to evaluate your string, which is much easier than parsing the whole thing:
import ast
x=np.array(ast.literal_eval(read))
Or simply eval:
x=np.array(eval(read))
But this will raise an error because of the leading zeros you have, so first simply remove them:
import re
read=re.sub(r'\b0','',read)
Also if you are writing the file, it is much more advisable to use other approaches, first I would suggest to simply use pickle.

How to generate unigram, bigram and trigram from a large csv file and count their frequencies using nltk or pure python

I used this code and its generating unigrams, bigrams,trigrams from the given text. But i want to extract unigram,bigram and trigram from a specific coumn of a large csv file. Kindly help me how should i proceed
Firstly, some fancy code to produce the DataFrame.
from io import StringIO
import pandas as pd
sio = StringIO("""I am just going to type up something because you inserted an image instead ctr+c and ctr+v the code to Stackoverflow.
Actually, it's unclear what you want to do with the ngram counts.
Perhaps, it might be better to use the `nltk.everygrams()` if you want a global count.
And if you're going to build some sort of ngram language model, then it might not be efficient to do it as you have done too.""")
with sio as fin:
texts = [line for line in fin]
df = pd.DataFrame({'text': texts})
Then you can easily use DataFrame.apply to extract the ngrams, e.g.
from collections import Counter
from functools import partial
from nltk import ngrams, word_tokenize
for i in range(1, 4):
_ngrams = partial(ngrams, n=i)
df['{}-grams'.format(i)] = df['text'].apply(lambda x: Counter(_ngrams(word_tokenize(x))))

How to create a bag of words from csv file in python?

I am new to python. I have a csv file which has cleaned tweets. I want to create a bag of words of these tweets.
I have the following code but its not working correctly.
import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer
data = pd.read_csv(open("Twidb11.csv"), sep=' ')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data.Text)
count_vect.vocabulary_
Error:
.ParserError: Error tokenizing data. C error: Expected 19 fields in
line 5, saw 22
It's duplicated i think. U can see answer here. There are a lot of answers and comments.
So, solution can be:
data = pd.read_csv('Twidb11.csv', error_bad_lines=False)
Or:
df = pandas.read_csv(fileName, sep='delimiter', header=None)
"In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indeces for each field {0,1,2,...}."

Faster approach to transform a bunch of .csv file into HDF dataframe

1. Background
HDF is a superb file format for data storage and management.
I have a source data (365 .csv files) which containing the air quality data (time resolution 1h) for all monitoring sites (more than 1500) of China. Each file is consisted of many feature (particulate matter, SO2, etc) and its corresponding time.
I have uploaded some template files here for someone interested.
My goal ==> Merge all the files into one dataframe for efficient management
2. My code
# -*- coding: utf-8 -*-
#coding=utf-8
import pandas as pd
from pandas import HDFStore, DataFrame
from pandas import read_hdf
import os,sys,string
import numpy as np
### CREAT A EMPTY HDF5 FILE
hdf = HDFStore("site_2016_whole_year.h5")
### READ THE CSV FILES AND SAVE IT INTO HDF5 FORMAT
os.chdir("./site_2016/")
files = os.listdir("./")
files.sort()
### Read an template file to get the name of columns
test_file= "china_sites_20160101.csv"
test_f = pd.read_csv(test_file,encoding='utf_8')
site_columns = list(test_f.columns[3:])
print site_columns[1]
feature = ['pm25','pm10','O3','O3_8h','CO',"NO2",'SO2',"aqi"]
fe_dict = {"pm25":1,"aqi":0, 'pm10':3, 'SO2':5,'NO2':7, 'O3':9,"O3_8h":11, "CO": 13}
for k in range(0,len(feature),1):
data_2016 = {"date":[],'hour':[],}
for i in range(0,len(site_columns),1):
data_2016[site_columns[i]] = []
for file in files[0:]:
filename,extname = os.path.splitext(file)
if (extname == ".csv"):
datafile =file
f_day = pd.read_csv(datafile,encoding='utf_8')
site_columns = list(f_day.columns[3:])
for i in range(0,len(f_day),15):
datetime = str(f_day["date"].iloc[i])
hour = "%02d" % ((f_day["hour"].iloc[i]))
data_2016["date"].append(datetime)
data_2016["hour"].append(hour)
for t in range(0,len(site_columns),1):
data_2016[site_columns[t]].\
append(f_day[site_columns[t]].iloc[i+fe_dict[feature[k]]])]
data_2016 = pd.DataFrame(data_2016)
hdf.put(feature[k], data_2016, format='table', encoding="utf-8")
3. My problem
Using my code above, the hdf5 file can be created but with slow speed.
My lab has a Linux cluster with 32 core CPU. Is there any method to transform my program into multi-processing ones?
maybe I do not understand your problem properly, but I would use something like this:
import os
import pandas as pd
indir = <'folder with downloaded 12 csv files'>
indata = []
for i in os.listdir(indir):
indata.append(pd.read_csv(indir + i))
out_table = pd.concat(indata)
hdf = pd.HDFStore("site_2016_whole_year.h5", complevel=9, complib='blosc')
hdf.put('table1',out_table)
hdf.close()
for 12 input files it takes 2.5 sec on my laptop, so even for 365 files it should be done in a minute or so. I do not see need for parallelisation in this case.

How do I import a CSV file?

I am new to using Python and Pandas and I am trying to import a CSV or text file to an array with quotes in between the issues like
sp500 = ['appl', 'ibm', 'csco']
df = pd.read_csv('C:\\data\\stock.txt', index_col=[0])
df
which gets me:
Out[20]:
Empty DataFrame
Columns: []
Index: [AAPL, IBM, CSCO]
Any help would be great.