apex_data_parser file > 50MB

apex_data_parser file > 50MB - oracle-apex

I have a page to upload a file using apex_data_parser when the user click the button.
The SQL bellow using apex_data_parser to upload a file with 550K rows and 52 MB take about 180 min to finish.
The same file, same page, but I delete some rows and keep 500K rows and 48 MB, take about 8 min to finish.
I had tested with the same file 750K rows and 40 MB (I delete the column with some description to stay bellow 50 MB) took 15 min
Somebody have any ideia why the time is so different with the file above 50 MB?
Reading the API the limit is 2GB and 4GB to the file
The SQL I took from the sample upload in app App Gallery
FOR r_temp in (SELECT line_number, col001, col002, col003, col004, col005, col006, col007, col008
from apex_application_temp_files f,
table( apex_data_parser.parse(
p_content => f.blob_content,
p_file_type => 2,
p_skip_rows => 1,
p_file_name => f.filename ) ) p
where f.name = p_name)
LOOP
...
It's a .csv file
Column Position Column Name Data Type Format Mask
1 LINE NUMBER -
2 ACCOUNT NUMBER -
3 DATETIME_CALL DATE YYYY"-"MM"-"DD" "HH24":"MI":"SS
4 TYPE_CALL VARCHAR2(255) -
5 CALL NUMBER -
6 DURATION NUMBER -
7 UNIT VARCHAR2(50) -
8 PRICE NUMBER -
What I did next?
To simplifie the problem I changed the sql statement to a simple count(*).
I have create a demo account at oracle cloud and started a Autonomous Transaction Processing using the same file, same appplication to test. The results: File greater than 50MB 6 hours to execute a SQL count statement (see attachament bellow). File with 48MB 3 minutes to execute the same SQL count statement. Maybe a apex.parser limit?
This chart below is interesting, the User I/O goes up a lot, only with > 50 MB in my tests.
I took the 50 MB file that has processed OK in 3 minutes and copy some rows to increase until 70 MB (so the file is not corrupt)

I believe the answer can be found in this question: Oracle APEX apex_data_parser execution time explodes for big files

I don't think it's a function of the file size, it's more a function of the shape and width of the data.
180 minutes is a super long time. If you're able to reproduce this, can you examine the active database session and determine the active SQL statement and any associated wait event?
Also - what file format is this, and what database version are you using?

Related

What is the cause of these google beta transcoding service job validation errors

"failureReason": "Job validation failed: Request field config is
invalid, expected an estimated total output size of at most 400 GB
(current value is 1194622697155 bytes).",
The actual input file was only 8 seconds long. It was created using the safari media recorder api on mac osx.
"failureReason": "Job validation failed: Request field
config.editList[0].startTimeOffset is 0s, expected start time less
than the minimum duration of all inputs for this atom (0s).",
The actual input file was 8 seconds long. It was created using the desktop Chrome media recorder api, with mimeType "webm; codecs=vp9" on mac osx.
Note that Stackoverlow wouldn't allow me to include the tag google-cloud-transcoder suggested by "Getting Support" https://cloud.google.com/transcoder/docs/getting-support?hl=sr

Like Faniel mentioned, your first issue is that your video was less than 10 seconds which is below the minimum 10 seconds for the API.
Your second issue is that the "Duration" information is likely missing from the EBML headers of your .webm file. When you record with MediaRecorder the duration of your video is set to N/A in the file headers as it is not known in advance. This means the Transcoder API will treat the length of your video is Infinity / 0. Some consider this a bug with Chromium.
To confirm this is your issue you can use ts-ebml or ffprobe to inspect the headers of your video. You can also use these tools to repair the headers. Read more about this here and here
Also just try running with the Transcoder API with this demo .webm which has its duration information set correctly.

This Google documentation states that the input file’s length must be at least 5 seconds in duration and should be stored in Cloud Storage (for example, gs://bucket/inputs/file.mp4). Job Validation error can occur when the inputs are not properly packaged and don't contain duration metadata or contain incorrect duration metadata. When the inputs are not properly packaged, we can explicitly specify startTimeOffset and endTimeOffset in the job config to set the correct duration. If the duration of the ffprobe output (in seconds) of the job config is more than 400 GB, it can result in a job validation error. We can use the following formula to estimate the output size.
estimatedTotalOutputSizeInBytes = bitrateBps * outputDurationInSec / 8;

Thanks for the question and feedback. The Transcoder API currently has a minimum duration of 10 seconds which may be why the job wasn't successful.

Improve UPDATE-per-second performance of SQLite?

My question comes directly from this one, although I'm only interested on UPDATE and only that.
I have an application written in C/C++ which makes heavy use of SQLite, mostly SELECT/UPDATE, on a very frequent interval (about 20 queries every 0.5 to 1 second)
My database is not big, about 2500 records at the moments, here is the table structure:
CREATE TABLE player (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name VARCHAR(64) UNIQUE,
stats VARBINARY,
rules VARBINARY
);
Up to this point I did not used transactions because I was improving the code and wanted stability rather performance.
Then I measured my database performance by merely executing 10 update queries, the following (in a loop of different values):
// 10 times execution of this
UPDATE player SET stats = ? WHERE (name = ?)
where stats is a JSON of exactly 150 characters and name is from 5-10 characters.
Without transactions, the result is unacceptable: - about 1 full second (0.096 each)
With transactions, the time drops x7.5 times: - about 0.11 - 0.16 seconds (0.013 each)
I tried deleting a large part of the database and/or re-ordering / deleting columns to see if that changes anything but it did not. I get the above numbers even if the database contains just 100 records (tested).
I then tried playing with PRAGMA options:
PRAGMA synchronous = NORMAL
PRAGMA journal_mode = MEMORY
Gave me smaller times but not always, more like about 0.08 - 0.14 seconds
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY
Finally gave me extremely small times about 0.002 - 0.003 seconds but I don't want to use it since my application saves the database every second and there's a high chance of corrupted database on OS / power failure.
My C SQLite code for queries is: (comments/error handling/unrelated parts omitted)
// start transaction
sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, NULL);
// query
sqlite3_stmt *statement = NULL;
int out = sqlite3_prepare_v2(query.c_str(), -1, &statement, NULL);
// bindings
for(size_t x = 0, sz = bindings.size(); x < sz; x++) {
out = sqlite3_bind_text(statement, x+1, bindings[x].text_value.c_str(), bindings[x].text_value.size(), SQLITE_TRANSIENT);
...
}
// execute
out = sqlite3_step(statement);
if (out != SQLITE_OK) {
// should finalize the query no mind the error
if (statement != NULL) {
sqlite3_finalize(statement);
}
}
// end the transaction
sqlite3_exec(db, "END TRANSACTION", NULL, NULL, NULL);
As you see, it's a pretty typical TABLE, records number is small and I'm doing a plain simple UPDATE exactly 10 times. Is there anything else I could do to decrease my UPDATE times? I'm using the latest SQLite 3.16.2.
NOTE: The timings above are coming directly from a single END TRANSACTION query. Queries are done into a simple transaction and i'm
using a prepared statement.
UPDATE:
I performed some tests with transaction enabled and disabled and various updates count. I performed the tests with the following settings:
VACUUM;
PRAGMA synchronous = NORMAL; -- def: FULL
PRAGMA journal_mode = WAL; -- def: DELETE
PRAGMA page_size = 4096; -- def: 1024
The results follows:
no transactions (10 updates)
0.30800 secs (0.0308 per update)
0.30200 secs
0.36200 secs
0.28600 secs
no transactions (100 updates)
2.64400 secs (0.02644 each update)
2.61200 secs
2.76400 secs
2.68700 secs
no transactions (1000 updates)
28.02800 secs (0.028 each update)
27.73700 secs
..
with transactions (10 updates)
0.12800 secs (0.0128 each update)
0.08100 secs
0.16400 secs
0.10400 secs
with transactions (100 updates)
0.088 secs (0.00088 each update)
0.091 secs
0.052 secs
0.101 secs
with transactions (1000 updates)
0.08900 secs (0.000089 each update)
0.15000 secs
0.11000 secs
0.09100 secs
My conclusions are that with transactions there's no sense in time cost per query. Perhaps the times gets bigger with colossal number of updates but i'm not interested in those numbers. There's literally no time cost difference between 10 and 1000 updates on a single transaction. However i'm wondering if this is a hardware limit on my machine and can't do much. It seems i cannot go below ~100 miliseconds using a single transaction and ranging 10-1000 updates, even by using WAL.
Without transactions there's a fixed time cost of around 0.025 seconds.

With such small amounts of data, the time for the database operation itself is insignificant; what you're measuring is the transaction overhead (the time needed to force the write to the disk), which depends on the OS, the file system, and the hardware.
If you can live with its restrictions (mostly, no network), you can use asynchronous writes by enabling WAL mode.

You may still be limited by the time it takes to commit a transaction. In your first example each transaction took about 0.10 to complete which is pretty close to the transaction time for inserting 10 records. What kind of results do you get if you batch 100 or 1000 updates in a single transaction?
Also, SQLite expects around 60 transactions per second on an average hard drive, while you're only getting about 10. Could your disk performance be the issue here?
https://sqlite.org/faq.html#q19

Try adding INDEXEs to your database:
CREATE INDEX IDXname ON player (name)

Limiting the lifetime of a file in Python

Helo there,
I'm looking for a way to limit a lifetime of a file in Python, that is, to make a file which will be auto deleted after 5 minutes after creation.
Problem:
I have a Django based webpage which has a service that generates plots (from user submitted input data) which are showed on the web page as .png images. The images get stored on the disk upon creation.
Image files are created per session basis, and should only be available a limited time after the user has seen them, and should be deleted 5 minutes after they have been created.
Possible solutions:
I've looked at Python tempfile, but that is not what I need, because the user should have to be able to return to the page containing the image without waiting for it to be generated again. In other words it shouldn't be destroyed as soon as it is closed
The other way that comes in mind is to call some sort of an external bash script which would delete files older than 5 minutes.
Does anybody know a preferred way doing this?
Ideas can also include changing the logic of showing/generating the image files.

You should write a Django custom management command to delete old files that you can then call from cron.
If you want no files older than 5 minutes, then you need to call it every 5 minutes of course. And yes, it would run unnecessarily when there are no users, but that shouln't worry you too much.

Ok that might be a good approach i guess...
You can write a script that checks your directory and delete outdated files, and choose the oldest file from the un-deleted files. Calculate how much time had passed since that file is created and calculate the remaining time to deletion of that file. Then call sleep function with remaining time. When sleep time ends and another loop begins, there will be (at least) one file to be deleted. If there is no files in the directory, set sleep time to 5 minutes.
In that way you will ensure that each file will be deleted exactly 5 minutes later, but when there are lots of files created simultaneously, sleep time will decrease greatly and your function will begin to check each file more and more often. To aviod that you add a proper latency to sleep function before starting another loop, like, if the oldest file is 4 minutes old, you can set sleep to 60+30 seconds (adding all time calculations 30 seconds).
An example:
from datetime import datetime
import time
import os
def clearDirectory():
while True:
_time_list = []
_now = time.mktime(datetime.now().timetuple())
for _f in os.listdir('/path/to/your/directory'):
if os.path.isfile(_f):
_f_time = os.path.getmtime(_f) #get file creation/modification time
if _now - _f_time < 300:
os.remove(_f) # delete outdated file
else:
_time_list.append(_f_time) # add time info to list
# after check all files, choose the oldest file creation time from list
_sleep_time = (_now - min(_time_list)) if _time_list else 300 #if _time_list is empty, set sleep time as 300 seconds, else calculate it based on the oldest file creation time
time.sleep(_sleep_time)
But as i said, if files are created oftenly, it is better to set a latency for sleep time
time.sleep(_sleep_time + 30) # sleep 30 seconds more so some other files might be outdated during that time too...
Also, it is better to read getmtime function for details.

Increasing SQLite SELECT performance

I have a program that does some math in an SQL query. There are hundreds of thousands rows (some device measurements) in an SQLite table, and using this query, the application breaks these measurements into groups of, for example, 10000 records, and calculates the average for each group. Then it returns the average value for each of these groups.
The query looks like this:
SELECT strftime('%s',Min(Stamp)) AS DateTimeStamp,
AVG(P) AS MeasuredValue,
((100 * (strftime('%s', [Stamp]) - 1334580095)) /
(1336504574 - 1334580095)) AS SubIntervalNumber
FROM LogValues
WHERE ((DeviceID=1) AND (Stamp >= datetime(1334580095, 'unixepoch')) AND
(Stamp <= datetime(1336504574, 'unixepoch')))
GROUP BY ((100 * (strftime('%s', [Stamp]) - 1334580095)) /
(1336504574 - 1334580095)) ORDER BY MIN(Stamp)
The numbers in this request are substituted by my application with some values.
I don't know if i can optimize this request more (if anyone could help me to do so, i'd really appreciate)..
This SQL query can be executed using an SQLite command line shell (sqlite3.exe). On my Intel Core i5 machine it takes 4 seconds to complete (there are 100000 records in the database that are being processed).
Now, if i write a C program, using sqlite.h C interface, I am waiting for 14 seconds for exactly the same query to complete. This C program "waits" during these 14 seconds on the first sqlite3_step() function call (any following sqlite3_step() calls are executed immediately).
From the Sqlite download page I have downloaded SQLite command line shell's source code and build it using Visual Studio 2008. I ran it and executed the query. Again 14 seconds.
So why does a prebuilt, downloaded from the sqlite website, command line tool takes only 4 seconds, while the same tool, built by me, takes 4 times longer time to execute?
I am running Windows 64 bit. The prebuilt tool is an x86 process. It also does not seem to be multicore optimized - in a Task Manager, during query execution, I can see only one core busy, for both built-by-mine and prebuilt SQLite shells.
Any way I could make my C program execute this query as fast as the prebuilt command line tool does it?
Thanks!

RRDTool: RRD file not updating

My RRD file is not updating, what is the reason?
The graph shows the legend with: -nanv
I created the RRD file using this syntax:
rrdtool create ups.rrd --step 300
DS:input:GAUGE:600:0:360
DS:output:GAUGE:600:0:360
DS:temp:GAUGE:600:0:100
DS:load:GAUGE:600:0:100
DS:bcharge:GAUGE:600:0:100
DS:battv:GAUGE:600:0:100
RRA:AVERAGE:0.5:12:24
RRA:AVERAGE:0.5:288:31
Then I updated the file with this syntax:
rrdtool update ups.rrd N:$inputv:$outputv:$temp:$load:$bcharge:$battv
And graphed it with this:
rrdtool graph ups-day.png
-t "ups "
-s -1day
-h 120 -w 616
-a PNG
-cBACK#F9F9F9
-cSHADEA#DDDDDD
-cSHADEB#DDDDDD
-cGRID#D0D0D0
-cMGRID#D0D0D0
-cARROW#0033CC
DEF:input=ups.rrd:input:AVERAGE
DEF:output=ups.rrd:output:AVERAGE
DEF:temp=ups.rrd:temp:AVERAGE
DEF:load=ups.rrd:load:AVERAGE
DEF:bcharge=ups.rrd:bcharge:AVERAGE
DEF:battv=ups.rrd:battv:AVERAGE
LINE:input#336600
AREA:input#32CD3260:"Input Voltage"
GPRINT:input:MAX:" Max %lgv"
GPRINT:input:AVERAGE:" Avg %lgv"
GPRINT:input:LAST:"Current %lgv\n"
LINE:output#4169E1:"Output Voltage"
GPRINT:output:MAX:"Max %lgv"
GPRINT:output:AVERAGE:" Avg %lgv"
GPRINT:output:LAST:"Current %lgv\n"
LINE:load#FD570E:"Load"
GPRINT:load:MAX:" Max %lg%%"
GPRINT:load:AVERAGE:" Avg %lg%%"
GPRINT:load:LAST:" Current %lg%%\n"
LINE:temp#000ACE:"Temperature"
GPRINT:temp:MAX:" Max %lgc"
GPRINT:temp:AVERAGE:" Avg %lgc"
GPRINT:temp:LAST:" Current %lgc"

You will need at least 13 updates, each 5min apart (IE, 12 PDP (primary data points)) before you can get a single CDP (consolidated data point) written to your RRAs, enabling you to get a data point on the graph. This is because your smallest resolution RRA is a Count 12, meaning you need 12 PDP to make one CDP.
Until you have enough data to write a CDP, you have nothing to graph, and your graph will always have unknown data.
Alternatively, add a smaller resolution RRA (maybe Count 1) so that you do not need to collect data for so long before you have a full CDP.

The update script needs to be run at exactly the same interval as defined in your database.
I see it has a step value of 300 so the database should be updated every 5 minutes.
Just place you update script in a cron job (you can also do it for your graph script)
For example,
sudo crontab -e
If run for the first time choose your favorite editor (I usually go with Vim) and add the full path location of your script and run it every 5 minutes. So add this (don't forget to rename the path):
*/5 * * * * /usr/local/update_script > /dev/null && /usr/local/graph_script > /dev/null
Save it, and wait a couple of minutes. I usually redirect the output to /dev/null in case of the output that can be generated by a script. So if a script that will be executed gives an output crontab will fail and send a notification.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

apex_data_parser file > 50MB - oracle-apex

I believe the answer can be found in this question: Oracle APEX apex_data_parser execution time explodes for big files

Related

What is the cause of these google beta transcoding service job validation errors

Improve UPDATE-per-second performance of SQLite?

Limiting the lifetime of a file in Python

Increasing SQLite SELECT performance

RRDTool: RRD file not updating

Categories

Resources