Querying table with >1000 columns fails - questdb

I can create and ingest data into a table with 1100 columns, but when I try to run any kind of query on it, like get all vals:
select * from iot_agg;
It looks like I cannot read it with the following error
io.questdb.cairo.CairoException: [24] Cannot open file: /root/.questdb/db/table/iot_agg.d
at io.questdb.std.ThreadLocal.initialValue(ThreadLocal.java:36)
at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180)
at java.lang.ThreadLocal.get(ThreadLocal.java:170)
at io.questdb.cairo.CairoException.instance(CairoException.java:38)
at io.questdb.cairo.ReadOnlyMemory.of(ReadOnlyMemory.java:135)
at io.questdb.cairo.ReadOnlyMemory.<init>(ReadOnlyMemory.java:44)
at io.questdb.cairo.TableReader.reloadColumnAt(TableReader.java:1031)
at io.questdb.cairo.TableReader.openPartitionColumns(TableReader.java:862)
at io.questdb.cairo.TableReader.openPartition0(TableReader.java:841)
at io.questdb.cairo.TableReader.openPartition(TableReader.java:806)
...

Ouroborus might be right in suggesting that the schema could be revisited, but regarding the actual error from Cairo:
24: OS error, too many open files
This is dependent on the OS that the instance is running on, and is tied to system-wide or user settings, which can be increased if necessary.
It is relatively common to hit limits like this for multiple different DB engines which handle large amounts of files. This is commonly configured with kernel variables to set the maximum number of open files. Checking the max limit for open files can be done on Linux and MacOS with
ulimit -n
You can also use ulimit to set this to a value you need. If you need to set it to 10,000, for example, you can do this with:
ulimit -n 10000
edit: There is official documentation for capacity planning when deploying QuestDB which takes several factors such as CPU, memory, network capacity, and a combination of these elements into consideration. For more information, see the capacity planning guide

Related

How can I aggregate intel amplifier batch results?

I'm solving a number of instances with my code and I'd need to find the worst hotspots, where "worst" is defined as a hotspot over a wide range of instances. So for every instance I have collected hotspot analysis data in batch mode using amplxe-cl. Now I'd like to aggregate this data, I'd like to analyze them together. Is there any way to do this with vtune?
Update:
This is not an mpi application. There are a number of different datasets (problems, instances, pick your term :-) that need to be processed by my application. Depending on the data in a single instance the application can take very different turns while processing it, thus running the application on different instances can result in different hotspots. The purpose of the aggregation would be, as #ArunJose_Intel guessed, is to find hotspots that are common in all runs, that are present in the processing of all kind of instances.
I can collect hotspot analysis for every instance easily using batch mode and I can inspect them individually, but I'd like to see an aggregate analysis.
Of course, I could just process them in one run one after the other, but that would take several weeks, while I can process them as individual problems in a few hours on a cluster of identical machines.
In vtune it is not possible to combine multiple GUI reports. You have an option to compare across two different reports to see what has changed but clearly this is not what you are looking for.
A workaround you could possibly try is to create command line reports from the vtune results you have already collected. These command line reports would be in easily parsable data formats like CSV . Once you have reports in these formats you could have could write your custom scripts/code to aggregate multiple of these csv reports, with whatever logic you wish to have them aggregated.
Please find below some samples to create command line reports
1)Generate a Hotspots report from the r001hs result on Linux*, and save it to /home/test/MyReport.txt in text format.
vtune -report hotspots -result-dir r001hs -report-output /home/test/MyReport.txt
2)Generate a hotspots report in the CSV format from the most recent result and save it in the current Linux working directory. Use the format option with the csv argument and the csv-delimiter option to specify a delimiter, such as comma.
vtune -R hotspots -report-output MyReport.csv -format csv -csv-delimiter comma
For more information
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports.html
https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/command-line-interface/generating-command-line-reports/saving-and-formatting-reports.htm

Google Compute Engine memory Utilization

I got a "recommendation" to add more memory to my 1 vCPU, 1.75 GB Google Compute Engine instance. I added a GB, and all is quiet.
However it has increased my overall cost about 50% (if I am reading it right - a task in and of itself), and I'd like to know what my memory utilization is.
I see it tracking CPU, Disk, and network, but not memory. I looked at the monitoring options and don't see memory as an option for GCE.
How do I monitor memory over time? I want to make sure I am running efficiently AND cheaply.
( see this question never got answered Memory usage metric identifier Google Compute Engine)
There are a couple of methods you could use to monitor the memory usage of a Compute Engine instance.
The first involves the use of the Stackdriver Monitoring Agent. This can be installed on the instance, and provides additional metrics including memory usage. For more information on this please see here.
Alternatively you could use a more 'Linux-esque' approach. For example, you could use the watch command to track used/free memory at intervals and output this to a file. The following command would allow you to do this:
watch -n 2 free 'wc -l my.log | tee -a memory.log'
This would create an output file ('memory.log') displaying your memory usage at 2 seconds intervals (To change the interval, change the number 2 to however many seconds you require).

Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’

ETL developer reports they have been trying to run our weekly and daily processes on ADW consistently. While for the most part they are executing without exception, I am now getting this error:
“Could not allocate a new page for database ‘TEMPDB’ because of insufficient disk space in filegroup ‘DEFAULT’. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.”
Is there a limit on TEMPDB space associated with the DWU setting?
The database is limited to 100TB (per the portal) and not full.
Azure SQL Data Warehouse does allocate space for a tempdb, at around 399 GB per 100 DWU. Reference here.
What DWU are you using at the moment? Consider temporarily raising your DWU aka service objective or refactoring your job to be less dependent on tempdb. Lower it when your batch process is finished.
It might also be worth checking your workload for anything like cartesian products, excessive sorting, over-dependency on temp tables etc to see if any optimisation can be done.
Have a look at the Explain Plans for your code, and see whether you have a lot more data movement going on than you expect. If you find that one query does moved a lot more into Q tables, you can probably tune it to avoid the data movement (which may mean redesigning tables to distribute in a different key).

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases

How to increase Mappers and Reducer in Apache TEZ

I know this simple question, I need some help on this query from this community, When I create PartitionTable with ORC format, When I try to dump data from non partition table which is pointing to 2 GB File with 210 columns, I see Number of Mapper are 2 and reducer are 2 . is there a way to increase Mapper and reducer. My assumption is we cant set number of Mapper and reducer like MR 1.0, It is based on Settings like Yarn container size, Mapper minimum memory and maximum memory . can any one suggest me TEz Calculates mappers and reducers. What is best value to keep memory size setting, so that i dont come across : Java heap space, Java Out of Memory problem. My file size may grow upto 100GB. Please help me on this.
You can still set the number of mappers and reducers in Yarn. Have you tried that? If so, please get back here.
Yarn changes the underlying execution mechanism, but #mappers and #reducers is describing the Job requirements - not the way the job resources are allocated (which is how yarn and mrv1 differ).
Traditional Map/Reduce has a hard coded number of map and reduce "slot". As you say - Yarn uses containers - which are per-application. Yarn is thus more flexible. But the #mappers and #reducers are inputs of the job in both cases. And also in both cases the actual number of mappers and reducers may differ from the requested number. Typically the #reducers would either be
(a) precisely the number that was requested
(b) exactly ONE reducer - that is if the job required it such as in total ordering
For the memory settings, if you are using hive with tez, the following 2 settings will be of use to you:
1) hive.tez.container.size - this is the size of the Yarn Container that will be used ( value in MB ).
2) hive.tez.java.opts - this is for the java opts that will be used for each task. If container size is set to 1024 MB, set java opts to say something like "-Xmx800m" and not "-Xmx1024m". YARN kills processes that use more memory than specified container size and given that a java process's memory footprint usually can exceed the specified Xmx value, setting Xmx to be the same value as the container size usually leads to problems.