Notes:
I have a django model on PostgreSQL having:
raw_data = models.TextField(_("raw_data"), default='')
It just storing some raw data which can be 1K - 200K.
I have 50 millions rows.
Req:
I need to decrease size of data in database.
Questions:
How can I tell what consume most data over all database ?
Should I use a string compression before storing the data ?
2.1 I saw here: Text compression in PostgreSQL that it get compressed anyway, is that true ?
2.2 I did some python code compression, I am not sure if changing bytes to string type can cause in lossing data:
def shrink_raw_data(username):
follower_data = get_string_from_database()
text = json.dumps(follower_data).encode('utf-8') # outputs as a bytes
# Checking size of text
text_size = sys.getsizeof(text)
print("\nsize of original text", text_size)
# Compressing text
compressed = str(zlib.compress(text, 9))
# store String in database
# Checking size of text after compression
csize = sys.getsizeof(compressed)
print("\nsize of compressed text", csize)
# Decompressing text
decompressed = zlib.decompress(compressed)
# Checking size of text after decompression
dsize = sys.getsizeof(decompressed)
print("\nsize of decompressed text", dsize)
print("\nDifference of size= ", text_size - csize)
follower_data_reload = json.loads(decompressed)
print(follower_data_reload == follower_data)
2.3 since my data is stored as a string in the db, is this line "str(zlib.compress(text, 9))" VALID ?
Related
I am trying to use this Django reference [Streaming large CSV files][1]
[1]: https://docs.djangoproject.com/en/2.2/howto/outputting-csv/#streaming-large-csv-files to download a pandas dataframe as csv file.
It requires a generator.
# Generate a sequence of rows. The range is based on the maximum number of
# rows that can be handled by a single sheet in most spreadsheet
# applications.
rows = (["Row {}".format(idx), str(idx)] for idx in range(65536))
if I have a dataframe called my_df (with 20 columns, and 10000 rows) .... how do I revise this logic to use my_df instead of generating numbers as in the example.
Something like:
response = HttpResponse(content_type='text/csv') # Format response as a CSV
filename = 'some_file_name.csv'
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'# Name the CSV response
my_df.to_csv(response, encoding='utf-8', index=False)
return response
What happens if I save an image twice using PIL, with same image quality.
from PIL import Image
quality = 85
# Open original image and save
img = Image.open('image.jpg')
img.save('test1.jpg', format='JPEG', quality=quality)
# Open the saved image and save again with same quality
img = Image.open('test1.jpg')
img.save('test2.jpg', format='JPEG', quality=quality)
There is almost no difference in the image size or the image quality.
Can I assume that saving an image multiple times with same quality does not affect the actual image quality and that it is a safe to do so?
Also, if I save an image with 85% quality and then open and save with 95% quality, the image size becomes much larger. Does that mean PIL decompresses the image and compresses it again?
In most cases your test1.jpg and test2.jpg images will be slightly different. Meaning, a loss of information stored in test1.jpg will hapen after you open (decompress) and save it (compress again) with lossy JPEG compression.
In some cases however, opening and storing a JPEG image with same software will not introduce any changes.
Take a look at this example:
from PIL import Image
import os
import hashlib
def md5sum(fn):
hasher = hashlib.md5()
with open(fn, 'rb') as f:
hasher.update(f.read())
return hasher.hexdigest()
TMP_FILENAME = 'tmp.jpg'
orig = Image.open(INPUT_IMAGE_FILENAME)
orig.save(TMP_FILENAME) # first JPG compression, standard quality
d = set()
for i in range(10000):
# Compute file statistics
file_size = os.stat(TMP_FILENAME).st_size
md5 = md5sum(TMP_FILENAME)
print ('Step {}, file size = {}, md5sum = {}'.format(i, file_size, md5))
if md5 in d: break
d.add(md5)
# Decompress / compress
im = Image.open(TMP_FILENAME)
im.save(TMP_FILENAME, quality=95)
It will open and save a JPG file repeatedly until a cycle is found (meaning an opened image has exactly the same data as occurred before).
In my testing, it takes anywhere from 50 to 700 cycles to reach a steady state (when opening and saving image does not produce any loss). However, the final "steady" image is noticeably different from the original.
Image after first JPG compression:
Resulting "steady" image after 115 compress/decompress cycles:
Sample output:
Step 0, file size = 38103, md5sum = ea28705015fe6e12b927296c53b6d147
Step 1, file size = 71707, md5sum = f5366050780be7e9c52dd490e9e69316
...
Step 113, file size = 70050, md5sum = 966aabe454aa8ec4fd57875bab7733da
Step 114, file size = 70050, md5sum = 585ecdd66b138f76ffe58fe9db919ad7
Step 115, file size = 70050, md5sum = 585ecdd66b138f76ffe58fe9db919ad7
So even though I used a relatively high quality setting of 95, as you can see, multiple repeated compression/decompression made the image to lose its colors and sharpness. Even for quality setting of 100 the result will be very similar despite almost twice bigger file size.
I am new to coding and have a lot of big data to deal with. Currently I am trying to merge 26 tsv files (each has two columns without a header, one is a contig _number the other is a count.
If a tsv did not have a count for a particular contig_number, it does not have that row - so I am attempting to use how = 'outer' and fill in the missing values with 0 afterwards.
I have been successful for the tsvs which I have subsetted to run the initial tests, but when I run the script on the actual data, which is large (~40,000 rows, two columns), more and more memory is used...
I got to 500Gb of RAM on the server and called it a day.
This is the code that is successful on the subsetted csvs:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
I would appreciate any advice or suggestions! Maybe there is just too much to hold in memory, and I should be trying something else.
Thank you!
I usually do these types of operations with pd.concat. I don't know the exact details of why it's more efficient, but pandas has some optimizations for combining indices.
I would do
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)})
#just keep the contig as the index
data_frames.append(df)
df_full=pd.concat(data_frames,axis=1)
and then df_full=df_full.fillna(0) if you want to.
In fact since each of your files has only one column (+ an index) you may do better yet by treating them as Series instead of DataFrame.
I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))
With the following Python code I want to parse a xml file. An extract of the xml file you can see below the code. I need to "extract" everything which is behind "inv: name =" like in this case "'datasource roof height' and (value = 1000 or value = 2000 or value = 3000 or value = 4000 or value = 5000 or value = 6000)". Any ideas?
My Python code (so far):
from lxml import etree
doc = etree.parse("data.xml")
for con in doc.xpath("//specification"):
for cons in con.xpath("./#body"):
with open("output.txt", "w") as cons_out:
cons_out.write(cons)
cons_out.close()
Part of the xml file:
<ownedRule xmi:type="uml:Constraint" xmi:id="EAID_OR000004_EE68_4efa_8E1B_8DDFA8F95FB8" name="datasource roof height">
<constrainedElement xmi:idref="EAID_94F3B0A6_EE68_4efa_8E1B_8DDFA8F95FB8"/>
<specification xmi:type="uml:OpaqueExpression" xmi:id="EAID_COE000004_EE68_4efa_8E1B_8DDFA8F95FB8" body="inv: name = 'datasource roof height' and (value = 1000 or value = 2000 or value = 3000 or value = 4000 or value = 5000 or value = 6000)"/>
</ownedRule>
XML Parsers understand attributes and elements. What is present within these attributes or elements (the textual content) is of no concern to the XML parser.
In order to solve your problem you would need to split the string retrieved from the body attribute. Of course, I am assuming that the body attribute for all elements would have the same format content i.e. "inv : name = some content"
from lxml import etree
doc = etree.parse("data.xml")
for con in doc.xpath("//specification"):
for cons in con.xpath("./#body"):
with open("output.txt", "w") as cons_out:
content = cons.split("inv: name =")[1]
cons_out.write(content)
cons_out.close()