Hive Compression Orc in Snappy - amazon-web-services

Using : Amazon Aws Hive (0.13)
Trying to : output orc files with snappy compression.
create external table output{
col1 string}
partitioned by (col2 string)
stored as orc
location 's3://mybucket'
tblproperties("orc.compress"="SNAPPY");
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output = true;
set mapred.output.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table output
partition(col2)
select col1,col2 from input;
The problem is that, when I look at the output in the mybucket directory, it is not with SNAPPY extension. However, it is a binary file though. What setting am I missing out to convert these orc file to be compressed and output with a SNAPPY extension ?

OrcFiles are binary files that are in a specialized format. When you specify orc.compress = SNAPPY the contents of the file are compressed using Snappy. Orc is a semi columnar file format.
Take a look at this documentation for more information about how data is laid out.
Streams are compressed using a codec, which is specified as a table property for all streams in that table To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.
In short, your files are compressed using Snappy codec, you just can't tell that they are because the blocks inside the file are what's actually compressed.

Additionally, you can use hive --orcfiledump /apps/hive/warehouse/orc/000000_0 to see the details of your file. The output will look like:
Reading ORC rows from /apps/hive/warehouse/orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 6
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:string,_col1:int>
Stripe Statistics:
Stripe 1:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
File Statistics:
Column 0: count: 6
Column 1: count: 6 min: Beth max: Owen sum: 29
Column 2: count: 6 min: 1 max: 6 sum: 21
....

Related

Scikit-learn labelencoder: how to preserve mappings between batches?

I have 185 million samples that will be about 3.8 MB per sample. To prepare my dataset, I will need to one-hot encode many of the features after which I end up with over 15,000 features.
But I need to prepare the dataset in batches since the memory footprint exceeds 100 GB for just the features alone when one hot encoding using only 3 million samples.
The question is how to preserve the encodings/mappings/labels between batches?
The batches are not going to have all the levels of a category necessarily. That is, batch #1 may have: Paris, Tokyo, Rome.
Batch #2 may have Paris, London.
But in the end I need to have Paris, Tokyo, Rome, London all mapped to one encoding all at once.
Assuming that I can not determine the levels of my Cities column of 185 million all at once since it won't fit in RAM, what should I do?
If I apply the same Labelencoder instance to different batches will the mappings remain the same?
I also will need to use one hot encoding either with scikitlearn or Keras' np_utilities_to_categorical in batches as well after this. So same question: how to basically use those three methods in batches or apply them at once to a file format stored on disk?
I suggest using Pandas' get_dummies() for this, since sklearn's OneHotEncoder() needs to see all possible categorical values when .fit(), otherwise it will throw an error when it encounters a new one during .transform().
# Create toy dataset and split to batches
data_column = pd.Series(['Paris', 'Tokyo', 'Rome', 'London', 'Chicago', 'Paris'])
batch_1 = data_column[:3]
batch_2 = data_column[3:]
# Convert categorical feature column to matrix of dummy variables
batch_1_encoded = pd.get_dummies(batch_1, prefix='City')
batch_2_encoded = pd.get_dummies(batch_2, prefix='City')
# Row-bind (append) Encoded Data Back Together
final_encoded = pd.concat([batch_1_encoded, batch_2_encoded], axis=0)
# Final wrap-up. Replace nans with 0, and convert flags from float to int
final_encoded = final_encoded.fillna(0)
final_encoded[final_encoded.columns] = final_encoded[final_encoded.columns].astype(int)
final_encoded
output
City_Chicago City_London City_Paris City_Rome City_Tokyo
0 0 0 1 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 1 0 0 0
4 1 0 0 0 0
5 0 0 1 0 0

Efficient storage of large string column in pandas dataframe

I have a large pandas dataframe with a string column that is highly skewed in size of the strings. Most of the rows have string of length < 20, but there are some rows whose string lengths are more than 2000.
I store this dataframe on disk using pandas.HDFStorage.append and set min_itemsize = 4000. However, this approach is highly inefficient, as the hdf5 file is very large in size, and we know that most of it are empty.
Is it possible to assign different sizes for the rows of this string column? That is, assign small min_itemsize to rows whose string is short, and assign large min_itemsize to rows whose string is long.
When using HDFStore to store strings, the maximum length of a string in the column is the width for that particular columns, this can be customized, see here.
Several options are available to handle various cases. Compression can help a lot.
In [6]: df = DataFrame({'A' : ['too']*10000})
In [7]: df.iloc[-1] = 'A'*4000
In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 1 columns):
A 10000 non-null object
dtypes: object(1)
memory usage: 156.2+ KB
These are fixed stores, the strings are stored as object types, so its not particularly performant; nor are these stores available for query / appends.
In [9]: df.to_hdf('test_no_compression_fixed.h5','df',mode='w',format='fixed')
In [10]: df.to_hdf('test_no_compression_table.h5','df',mode='w',format='table')
Table stores are quite flexible, but force a fixed size on the storage.
In [11]: df.to_hdf('test_compression_fixed.h5','df',mode='w',format='fixed',complib='blosc')
In [12]: df.to_hdf('test_compression_table.h5','df',mode='w',format='table',complib='blosc')
Generally using a categorical representation provides run-time and storage efficiency.
In [13]: df['A'] = df['A'].astype('category')
In [14]: df.to_hdf('test_categorical_table.h5','df',mode='w',format='table')
In [15]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 1 columns):
A 10000 non-null category
dtypes: category(1)
memory usage: 87.9 KB
In [18]: ls -ltr *.h5
-rw-rw-r-- 1162080 Aug 31 06:36 test_no_compression_fixed.h5
-rw-rw-r-- 1088361 Aug 31 06:39 test_compression_fixed.h5
-rw-rw-r-- 40179679 Aug 31 06:36 test_no_compression_table.h5
-rw-rw-r-- 259058 Aug 31 06:39 test_compression_table.h5
-rw-rw-r-- 339281 Aug 31 06:37 test_categorical_table.h5

csv parsing and manipulation using python

I have a csv file which i need to parse using python.
triggerid,timestamp,hw0,hw1,hw2,hw3
1,234,343,434,78,56
2,454,22,90,44,76
I need to read the file line by line, slice the triggerid,timestamp and hw3 columns from these. But the column-sequence may change from run to run. So i need to match the field name, count the column and then print out the output file as :
triggerid,timestamp,hw3
1,234,56
2,454,76
Also, is there a way to generate an hash-table(like we have in perl) such that i can store the entire column for hw0 (hw0 as key and the values in the columns as values) for other modifications.
I'm unsure what you mean by "count the column".
An easy way to read the data in would use pandas, which was designed for just this sort of manipulation. This creates a pandas DataFrame from your data using the first row as titles.
In [374]: import pandas as pd
In [375]: d = pd.read_csv("30735293.csv")
In [376]: d
Out[376]:
triggerid timestamp hw0 hw1 hw2 hw3
0 1 234 343 434 78 56
1 2 454 22 90 44 76
You can select one of the columns using a single column name, and multiple columns using a list of names:
In [377]: d[["triggerid", "timestamp", "hw3"]]
Out[377]:
triggerid timestamp hw3
0 1 234 56
1 2 454 76
You can also adjust the indexing so that one or more of the data columns are used as index values:
In [378]: d1 = d.set_index("hw0"); d1
Out[378]:
triggerid timestamp hw1 hw2 hw3
hw0
343 1 234 434 78 56
22 2 454 90 44 76
Using the .loc attribute you can retrieve a series for any indexed row:
In [390]: d1.loc[343]
Out[390]:
triggerid 1
timestamp 234
hw1 434
hw2 78
hw3 56
Name: 343, dtype: int64
You can use the column names to retrieve the individual column values from that one-row series:
In [393]: d1.loc[343]["triggerid"]
Out[393]: 1
Since you already have a solution for the slices here's something for the hash table part of the question:
import csv
with open('/path/to/file.csv','rb') as fin:
ht = {}
cr = csv.reader(fin)
k = cr.next()[2]
ht[k] = list()
for line in cr:
ht[k].append(line[2])
I used a different approach (using.index function)
bpt_mode = ["bpt_mode_64","bpt_mode_128"]
with open('StripValues.csv') as file:
for _ in xrange(1):
next(file)
for line in file:
stat_values = line.split(",")
draw_id=stats.index('trigger_id')
print stat_values[stats.index('trigger_id')],',',
for j in range(len(bpt_mode)):
print stat_values[stats.index('hw.gpu.s0.ss0.dg.'+bpt_mode[j])],',', file.close()
#holdenweb Though i am unable to figure out how to print the output to a file. Currently i am redirecting while running the script
Can you provide a solution for writing to a file. There will be multiple writes to a single file.

Python Pandas read_csv issue

I have simple CSV file that looks like this:
inches,12,3,56,80,45
tempF,60,45,32,80,52
I read in the CSV using this command:
import pandas as pd
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0)
Which results in this structure:
1 2 3 4 5
0
inches 12 3 56 80 45
tempF 60 45 32 80 52
But I want this (unnamed index column):
0 1 2 3 4
inches 12 3 56 80 45
tempF 60 45 32 80 52
EDIT: As #joris pointed out additional methods can be run on the resulting DataFrame to achieve the wanted structure. My question is specifically about whether or not this structure could be achieved through read_csv arguments.
from the documentation of the function:
names : array-like
List of column names to use. If file contains no header row, then you
should explicitly pass header=None
so, apparently:
pd_obj = pd.read_csv('test_csv.csv', header=None, index_col=0, names=range(5))

rrd graph configurate query

I am updating my RRD file with some counts...
For example:
time: value:
12:00 120
12:05 135
12:10 154
12:20 144
12:25 0
12:30 23
13:35 36
here my RRD is updating as below logic:
((current value)-(previous value))/((current time)-(previous time))
eg. ((135-120))/5 = 15
but my problem is when it comes 0 the reading will be negative:
((0-144))/5
Here " 0 " value comes with system failure only( from where the data is fetched)..It must not display this reading graph.
How can I configure like when 0 comes it will not update the "RRD graph" (skip this reading (0-144/5)) and next time it will take reading like ((23-0)/5) but not (23-144/10)
When specifying the data sources when creating the RRD, you can specify which range of values is acceptable.
DS:data_source:GAUGE:10:1:U will only accept values above 1.
So if you get a 0 during an update, rrd will replace it with unknown and i assume it can find a way to discard it.