allocate largest partition with ansible when parition name varies - amazon-web-services

I'm using ansible to configure some AWS servers. All the servers have a 1024 GB, or larger, partition. I want to allocate this partition and assign it to /data.
I have an ansible script that does this on my test machine. However, when I tried running it against all my AWS machines it failed on some, complaining /dev/nvme1n1 device doesn't exist. The problem is that some of the servers had a 20 GB root partition separate from the larger partition, and some don't. That means sometimes the partition I care about is nvme1n1 and sometimes it's nvm0n1.
I don't want to place a variable in the hosts file, since that file is being dynamically loaded from AWS anyways. Given that what is the easiest way to look up the largest device and get it's name in ansible so I can correctly tell ansible to allocate whichever device is largest?

I assume, when you talk, about "partitions", you mean "disks", as a partition will have a name like nvme0n1p1, while the disk would be called nvme0n1.
That said, I have not found an "ansible-way" to do that, so I usually parse lsblk and do some grep-magic. In your case this is what you need to run:
- name: find largest disk
shell: |
set -euo pipefail
lsblk -bl -o SIZE,NAME | grep -P 'nvme\dn\d$' | sort -nr | awk '{print $2}' | head -n1
args:
executable: /bin/bash
register: largest_disk
- name: print name of largest disk
debug:
msg: "{{ largest_disk.stdout }}"
You can use the name of the disk in the parted-module to do whatever you need with it.
Apart from that, you should add some checks before formatting your disks, so you don't overwrite something (e.g. if your playbook ran on a host before, the disk might already be formatted and contain data, and in that case, you would not format it again, because you would overwrite the existing data).
Explanation:
lsblk -bl -o SIZE,NAME prints the size and names of all block-devices
grep -P 'nvme\dn\d$' greps out all disks (partitions have some pXX in the end, remember?)
sort -nr sorts the output numerically by the first column (so you get the largest on top)
awk '{print $2}' prints only the second column (the name that is)
head -n1 returns the first line (containing the name of the largest disk)

Related

Querying table with >1000 columns fails

I can create and ingest data into a table with 1100 columns, but when I try to run any kind of query on it, like get all vals:
select * from iot_agg;
It looks like I cannot read it with the following error
io.questdb.cairo.CairoException: [24] Cannot open file: /root/.questdb/db/table/iot_agg.d
at io.questdb.std.ThreadLocal.initialValue(ThreadLocal.java:36)
at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:180)
at java.lang.ThreadLocal.get(ThreadLocal.java:170)
at io.questdb.cairo.CairoException.instance(CairoException.java:38)
at io.questdb.cairo.ReadOnlyMemory.of(ReadOnlyMemory.java:135)
at io.questdb.cairo.ReadOnlyMemory.<init>(ReadOnlyMemory.java:44)
at io.questdb.cairo.TableReader.reloadColumnAt(TableReader.java:1031)
at io.questdb.cairo.TableReader.openPartitionColumns(TableReader.java:862)
at io.questdb.cairo.TableReader.openPartition0(TableReader.java:841)
at io.questdb.cairo.TableReader.openPartition(TableReader.java:806)
...
Ouroborus might be right in suggesting that the schema could be revisited, but regarding the actual error from Cairo:
24: OS error, too many open files
This is dependent on the OS that the instance is running on, and is tied to system-wide or user settings, which can be increased if necessary.
It is relatively common to hit limits like this for multiple different DB engines which handle large amounts of files. This is commonly configured with kernel variables to set the maximum number of open files. Checking the max limit for open files can be done on Linux and MacOS with
ulimit -n
You can also use ulimit to set this to a value you need. If you need to set it to 10,000, for example, you can do this with:
ulimit -n 10000
edit: There is official documentation for capacity planning when deploying QuestDB which takes several factors such as CPU, memory, network capacity, and a combination of these elements into consideration. For more information, see the capacity planning guide

upload greater than 5TB object to google cloud storage bucket

The max size limit for a single object upload is 5TB. How does one do a backup of a larger single workload? Say - I have a single file that is 10TB or more - that needs to be backed up to cloud storage?
Also, a related question - if the 10 TB is spread across multiple files (each file is less than 5TB) in a single folder, that shouldn't affect anything correct? A single object can't be greater than 5TB, but there isn't a limit on the actual bucket size. Say a folder containing 3 objects equal to 10TB, that upload will be automatically split across multiple buckets (console or gsutil upload)?
Thanks
You are right. The current limit on the size for individual objects is 5 TB. In this way you might split your file.
About the limitation on the Total Bucket size, there is no limit documented on this. Actually, in the overview says "Cloud Storage provides worldwide, highly durable object storage that scales to exabytes of data.".
You might take a look into the best practices of GCS.
Maybe look over http://stromberg.dnsalias.org/~dstromberg/chunkup.html ?
I wrote it to backup to SRB, which is a sort of precursor to Google and Amazon bucketing. But chunkup itself doesn't depend on SRB or any other form of storage.
Usage: /usr/local/etc/chunkup -c chunkhandlerprog -n nameprefix [-t tmpdir] [-b blocksize]
-c specifies the chunk handling program to handle each chunk, and is required
-n specifies the nameprefix to use on all files created, and is required
-t specifies the temporary directory to write files to, and is optional. Defaults to $TMPDIR or /tmp
-b specifies the length of files to create, and is optional. Defaults to 1 gigabyte
-d specifies the number of digits to use in filenames. Defaults to 5
You can see example use of chunkup in http://stromberg.dnsalias.org/~strombrg/Backup.remote.html#SRB
HTH

WMI query to select disk containing system volume

I need to get some information (model and serial) of the disk that contains the system volume (usually C:). I'm using this query:
SELECT * FROM Win32_DiskDrive WHERE Index=0
My question is, is the disk with Index=0 always the disk containing the system volume?
Edit: I added an additional query to get the index of the disk containing the boot partition:
SELECT * FROM Win32_DiskPartition WHERE BootPartition=True
Then the original query changes to
SELECT * FROM Win32_DiskDrive WHERE Index={diskIndex}
I figured I'd be pretty safe this way. Suggestions for better solutions are always welcome :)
As stated, add an extra query to get the index of the disk containing the boot partition:
{diskIndex} = SELECT * FROM Win32_DiskPartition WHERE BootPartition=True
SELECT * FROM Win32_DiskDrive WHERE Index={diskIndex}
Unfortunatly WMI doesn't seem to support JOINs, which would have made the query a little more efficient.
My question is, is the disk with Index=0 always the disk containing the system volume?
In my case the answer is No. My system disk has index 1.
Also your assumption that the system disk is always bootable is incorrect.
$ wmic os get "SystemDrive"
SystemDrive
C:
$ wmic logicaldisk where 'DeviceID="C:"' assoc /resultclass:Win32_DiskPartition
...\\DZEN\ROOT\CIMV2:Win32_DiskPartition.DeviceID="Disk #1, Partition #0"...
wmic diskdrive where 'Index=1' get "Caption"
Caption
OCZ-VERTEX4 // Yes, this is my system disk.
Also your assumpion about BootPartition usage is incorrect for cases when bootmanager is on another disk, like in my case:
wmic partition where 'DeviceID like "Disk_#1%"' get DeviceID,BootPartition
BootPartition DeviceID
FALSE Disk #1, Partition #0
wmic partition where 'BootPartition="TRUE"' get DeviceID,BootPartition
BootPartition DeviceID
TRUE Disk #4, Partition #0
TRUE Disk #3, Partition #0
As you can see, nor the system disk neither one of bootable ones do not have Index=0 for my case. Actually I have Index=0 for the one of non system and non bootable disk.

How to pass arguments to streaming job on Amazon EMR

I want to produce the output of my map function, filtering the data by dates.
In local tests, I simply call the application passing the dates as parameters as:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | ./reducer.py
Then the parameters are taken in the map function
#!/usr/bin/python
date1 = sys.argv[1];
date2 = sys.argv[2];
The question is:
How do I pass the date parameters to the map calling on Amazon EMR?
I am a beginner in Map reduce. Will appreciate any help.
First of all,
When you run a local test, and you should as often as possible.
the correct format (in order to reproduce how map-reduce works) is:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 | sort | ./reducer.py | sort
That the way the hadoop framework works.
If you are looking on a big file, you should do it in steps to verify results of each line.
meaning:
cat access_log | ./mapper.py 20/12/2014 31/12/2014 > map_result.txt
cat map_result.txt | sort > map_result_sorted.txt
cat map_result_sorted.txt | ./reducer.py > reduce_result.txt
cat reduce_result.txt | sort > map_reduce_result.txt
In regard to your main question:
Its the same thing.
If you are going to use the amazon web console to create your cluster, in the add step window you just write as fallowing:
name: learning amazon emr
Mapper: (here they say: please give us s3 path to your mapper, we will ignore that, and just write our script name and parameters, no backslash...) mapper.py 20/12/2014 31/12/2014
Reducer: (the same as in the mapper) reducer.py (you can add here params too)
Input location: ...
Output location: ... (just remember to use a new output every time, or your task will fail)
Arguments: -files s3://cod/mapper.py,s3://cod/reducer.py (use your file path here, even if you add only one file use the -files argument)
That's it
If you are going into the all argument thing, i suggest you see this guy blog on how to use the passing of arguments in order to use only a single map,reduce file.
Hope it helped

Huge amout data analysis

Say we have about 1e10 lines of log file everyday, each one contains: an ID number(integer below 15 digits length), a login time, and a logout time. Some ID may login and logout several times.
Question 1:
How to count the total number of ID that have logged in?(We should not count each ID twice or more)
I tried to use a hashtable here, but I found the memory we should obtained may be so large.
Question 2:
Calculate the time when the population of online users are largest.
I think we may split the time of a day into 86400 seconds, then for each line of log file, add 1 to each seconds in the online interval. Or maybe I can sort the log file by login time?
you can do that in a *nix shell.
cut -f1 logname.log | sort | uniq | wc -l
cut -f2 logname.log | sort | uniq -c | sort -r
For question 2 to make sense: you probably have to log 2 things: user logs in and user logs out. Two different activities along with the user id. If this list is sorted by the time in which the activity (either log in or log out was done). You just scan with a counter called currentusers: add 1 for each log in and -1 for each log out. The maximum that number (current users) reaches is the value you're interested in, you will probably be interested also in tracking at what time it occurred..
For question 1, forget C++ and use *nix tools. Assuming the log file is space delimited, then the number of unique logins in a given log is computed by:
$ awk '{print $1}' foo.log | sort | uniq | wc -l
Gnu sort, will happily sort files larger than memory. Here's what each piece is doing:
awk is extracting the first space-delimited column (the ID number).
sort is sorting those ID numbers, because uniq needs sorted input.
uniq is returning only uniq numbers.
wc prints the number of lines, which will be the number of uniq numbers.
use a segment tree to store intervals of consecutive ids.
Scan the logs for all the login events.
To insert an id, first search a segment containing the id: if it exists, the id is a duplicate. If it doesn't search the segments which are right after or before the id. If they exist, remove them and merge the new id as needed, and insert the new segment. If they don't exist, insert the id as a segment of 1 element.
Once all ids have been inserted, count their number by summing the cardinals of all the segments in the tree.
assuming that:
a given id may be logged in only once at any given time,
events are stored in chronological order (that's what logs are normally)
Scan the log and keep a counter c of the number of currently logged in users, as well as the max number m found, and the associated time t. For each log in, increment c, and for each log out decrement it. At each step update m and t if m is lower than c.
For 1, you can try working with fragments of the file at a time that are small enough to fit into memory.
i.e. instead of
countUnique([1, 2, ... 1000000])
try
countUnique([1, 2, ... 1000]) +
countUnique([1001, 1002, ... 2000]) +
countUnique([2001, 2002, ...]) + ... + countUnique([999000, 999001, ... 1000000])
2 is a bit more tricky. Partitioning the work into manageable intervals (a second, as you suggested) is a good idea. For each second, find the number of people logged in during taht second by using the following check (pseudocode):
def loggedIn(loginTime, logoutTime, currentTimeInterval):
return user is logged in during currentTimeInterval
Apply loggedIn to all 86400 seconds, and then maximize the list of 86400 user counts to find the time that the population of online users is the largest.