How to take the output of Sys.command as string in OCaml? - ocaml

In OCaml, I have this piece of code:
let s =Sys.command ("minisat test.txt | grep 'SATIS' ");;
I want to take the output of minisat test.txt | grep "SATIS" , which is SATISFIABLE/UNSATISFIABLE to the string s.
I am getting the following output:
val s : int = 0
So, how can I make the output of this command to a string.
Also, is it possible to even import time?
This is the output I get when I try minisat test.txt in terminal
WARNING: for repeatability, setting FPU to use double precision
============================[ Problem Statistics ]=============================
| |
| Number of variables: 5 |
| Number of clauses: 3 |
| Parse time: 0.00 s |
| Eliminated clauses: 0.00 Mb |
| Simplification time: 0.00 s |
| |
============================[ Search Statistics ]==============================
| Conflicts | ORIGINAL | LEARNT | Progress |
| | Vars Clauses Literals | Limit Clauses Lit/Cl | |
restarts : 1
conflicts : 0 (-nan /sec)
decisions : 1 (0.00 % random) (inf /sec)
propagations : 0 (-nan /sec)
conflict literals : 0 (-nan % deleted)
Memory used : 8.00 MB
CPU time : 0 s

If you use just Sys, you can't.
However, you can create a temporary file (see the Filename module's documentation here) and tell the command to output in it:
let string_of_command () =
let tmp_file = Filename.temp_file "" ".txt" in
let _ = Sys.command ## "minisat test.txt | grep 'SATIS' >" ^ tmp_file in
let chan = open_in tmp_file in
let s = input_line chan in
close_in chan;
Note that this function is drafty: you have to properly handle potential errors happening. Anyway, you can adapt it to your needs I guess.
You can avoid the temporary file trick by using the Unix library or more advanced libraries.

You have to use Unix.open_process_in or Unix.create_process, if you want to capture the output.
Or better use a higher level wrapper like 'shell' (from ocamlnet):
But I wouldn't pipe it to grep (not portable). Parse the output with your favorite regex library inside OCAML.


nested if loop in splunk

I would like to write in splunk a nested if loop:
What I want to achieve
if buyer_from_France:
do eval percentage_fruits
if percentage_fruits> 10:
do summation
if summation>20:
if total_price>$50:
do(trigger bonus coupon)
My current code (that works):
> | eventstats sum(buyers_fruits) AS total_buyers_fruits by location
> | stats sum(fruits) as buyers_fruits by location buyers
> | eval percentage_fruits=fruits_bought/fruits_sold
> | table fruits_bought fruits_sold buyers
> | where percentage_fruits > 10
> | sort - percentage_fruits
How do I complete the syntax/expression for the 2nd (summation) and consequently, 3rd (total price), 4th if-loop (trigger)?
SPL doesn't do "loops". A close [enough] analog is that each line in SPL is similar to a single command in bash (hence the pipe separator between commands). IOW, SPL is purely linear in processing. Use a multi-condition eval..if like this:
index=ndx sourcetype=srctp
| eval myfield=if(match(fieldA,"someval") AND !match(fieldC,"notthis"),"all true","else val")
Or like this:
| eval myfield=if(match(fieldA,"someval"),if(match(fieldB,"otherval"),"matched A&B",if(!match(fieldC,"notthis"),"not A & not C","else val")))
If you can explain your use case/end goal better, we can probably provide better direction

Python 2.7 - insert text into a file before closing the file

I am writing some text into a file:
import codecs"c:/temp/myfile.sps","w+","utf-8-sig")
#procedures for creating the text_to_write
outfile.write (text_to_write)
Now, what I want to do is to insert into the file an additional text, always at a certain line (say line 10), but this additional text is final only after all the procedures for creating the text_to_write. So the code for inserting the additional text, at line 10, should be the last code:
Is this possible without closing the file, reopening, and then saving again ?
(the reopen-insert-close approach is detailed here, but I would like to avoid it). I am looking for something like this:
import codecs"c:/temp/myfile.sps","w+","utf-8-sig")
#procedures for creating the text_to_write
outfile.write (text_to_write)
#code for inserting additional text at line 10
Since you don't know the exact position (in bytes) of the insertion point, you need to read the lines of the file content, insert the additional text after the line 10 and write the file a second time.
note: a Python 2+3 way to open a file is to use the io module instead of the codecs module.
For instance, you have the following text to write and additional text:
text_to_write = u"""\
| 1 | This
| 2 |
| 3 | text
| 4 |
| 5 | contains
| 6 |
| 7 | at
| 8 |
| 9 | least
| 10 |
| 11 | ten
| 12 |
| 13 | lines."""
additional_text = u"""\
| ++ | TEXT
You can open the file for reading and writing. The file is created if it does not
exist, otherwise it is truncated. The stream is positioned at
the beginning of the file.
with"file.txt", mode="w+", encoding="utf-8-sig") as f:
lines = f.readlines()
lines[10:10] = additional_text.splitlines(keepends=True)
This solution is not very efficient because you read the content you just write.
You can also process everything in memory and then write the file.
The result is:
| 1 | This
| 2 |
| 3 | text
| 4 |
| 5 | contains
| 6 |
| 7 | at
| 8 |
| 9 | least
| 10 |
| ++ | TEXT
| 11 | ten
| 12 |
| 13 | lines.
Another solution using a list in memory:
lines = text_to_write.splitlines(keepends=True)
lines[10:10] = additional_text.splitlines(keepends=True)
with"file2.txt", mode="w+", encoding="utf-8-sig") as f:

Keep words starting with character/letter in Pandas | Python

I'm not sure how to do this in a dataframe context
I have the table below here with text information
"Get some new #turbo #stacks today!" |
"Is it one or three? #phone" |
"Mayhaps it be three afterall..." |
"So many new issues with phone... #iphone" |
And I want to edit it down to where only the words with a '#' symbol are kept, like in the result below.
"#turbo #stacks" |
"#phone" |
"" |
"#iphone" |
In some cases, I'd also like to know if it's possible to eliminate the rows that are empty by checking for NaN as true or if you run a different kind of condition to get this result:
"#turbo #stacks" |
"#phone" |
"#iphone" |
Python 2.7 and pandas for this.
You could try using regex and extractall:
df.TEXT.str.extractall('(#\w+)').groupby(level=0)[0].apply(' '.join)
0 #turbo #stacks
1 #phone
3 #iphone
Name: 0, dtype: object

Reading S3 files in nested directory through Spark EMR

I figured out how to read files into my pyspark shell (and script) from an S3 directory, e.g. by using:
rdd = sc.wholeTextFiles('s3n://bucketname/dir/*')
But, while that's great in letting me read all the files in ONE directory, I want to read every single file from all of the directories.
I don't want to flatten them or load everything at once, because I will have memory issues.
Instead, I need it to automatically go load all the files from each sub-directory in a batched manner. Is that possible?
Here's my directory structure:
S3_bucket_name -> year (2016 or 2017) -> month (max 12 folders) -> day (max 31 folders) -> sub-day folders (max 30; basically just partitioned the collecting each day).
Something like this, except it'll go for all 12 months and up to 31 days...
| |
| |---Month(11)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(12)
| |
| |---Month(1)
| | |
| | |---Day(01)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| | |---Day(02)
| | | |
| | | |---Sub-folder(01)
| | | |
| | | |---Sub-folder(02)
| | | |
| |---Month(2)
Each arrow above represents a fork. e.g. I've been collecting data for 2 years, so there are 2 years in the "year" fork. Then for each year, up to 12 months max, and then for each month, up to 31 possible day folders. And in each day, there will be up to 30 folders just because I split it up that way...
I hope that makes sense...
I was looking at another post (read files recursively from sub directories with spark from s3 or local filesystem) where I believe they suggested using wildcards, so something like:
rdd = sc.wholeTextFiles('s3n://bucketname/*/data/*/*')
But the problem with that is it tries to find a common folder among the various subdirectories - in this case there are no guarantees and I would just need everything.
However, on that line of reasoning, I thought what if I did..:
rdd = sc.wholeTextFiles("s3n://bucketname/*/*/*/*/*')
But the issue is that now I get OutOfMemory errors, probably because it's loading everything at once and freaking out.
Ideally, what I would be able to do is this:
Go to the sub-directory level of the day and read those in, so e.g.
First read in 2016/12/01, then 2016/12/02, up until 2012/12/31, and then 2017/01/01, then 2017/01/02, ... 2017/01/31 and so on.
That way, instead of using five wildcards (*) as I did above, I would somehow have it know to look trough each sub-directory at the level of "day".
I thought of using a python dictionary to specify the file path to each of the days, but that seems like a rather cumbersome approach. What I mean by that is as follows:
file_dict = {
basically for all the folders, and then iterating through them and loading them in using something like this:
sc.wholeTextFiles('s3n://bucketname/' + file_dict[i])
But I don't want to manually type out all those paths. I hope this made sense...
Another way of asking the question is, how do I read the files from a nested sub-directory structure in a batched way? How can I enumerate all the possible folder names in my s3 bucket in python? Maybe that would help...
The structure of the data in each of my files is as follows:
{json object 1},
{json object 2},
{json object 3},
{json object n},
For it to be "true json", it either just needed to be like the above without a trailing comma at the end, or something like this (note square brackets, and lack of the final trailing comma:
{json object 1},
{json object 2},
{json object 3},
{json object n}
The reason I did it entirely in PySpark as a script I submit is because I forced myself to handle this formatting quirk manually. If I use Hive/Athena, I am not sure how to deal with it.
Why dont you use Hive, or even better, Athena? These will both deploy tables ontop of file systems, to give you access to all the data. Then you can capture this in to Spark
Alternatively, I believe you can also use HiveQL in Spark to set up a tempTable ontop of your file system location, and it'll register it all as a Hive table which you can execute SQL against. It's been a while since I've done that, but it is definitely do-able

use perl to get information using regexs

I have the following output from a command that I run in Perl. However, I am only interested in capturing the components and their respective statuses. (I.e. the "component" and "status" columns)
I've been thinking about how to approach this using regular expressions. I would like to assume that i will not really know the names of the components as in the future there may be additional components in the column. I don't care about the two middle columns (process-type or pid).
Any suggestions would be helpful.
my $consoleStatus = opmnctl status 2>&1;
component | process-type | pid | status
serverpro | logloaderd | N/A | Down
www-daemon | www-daemon | 10000 | Alive
OXQA | oqa | 99894 | Alive
SDFW | OC4X_SECURITY | 27683 | Alive
FTP_Servers | HTTP_Server | 21252 | Alive
OID | OID | 27207 | Alive
DSA | DSA | N/A | Down
Assuming the layout of your output doesn't change, the component name doesn't have spaces, and the possible status are only 'Alive' and 'Down', you could use the given regex to match each line:
Below, I wrote a code that gets the input from STDIN, and prints out the component and it's status:
while(<STDIN>) {
if( $_ =~ /^(\S+)\s+\|.+\|\s+(Alive|Down)$/ ) {
print "$1 -> $2\n";
You might be able to use opmnctl options to simplify what the Perl has to process, maybe:
opmnctl status -noheaders -fsep '|' -fmt %cmp%sta
I'd suggest using split, and split on the pipe | characters that delimit the fields.
Here's a short snippet that may give you some ideas. If you can use some opmnctl options, you'd be able to simplify this.
use strict;
use warnings;
use Data::Dumper;
my %component_status;
LINE: for ( split( /\n/, $consoleStatus ) ) {
# Skip the furniture in the opmnctl output
next LINE if m/^component\W/ || m/^-/;
# field 0 is the component, field 3 the status.
my ( $component, $status ) = (split( / +\| */, $_ ))[0,3];
$component_status{$component} = $status;
warn Dumper( \%component_status );
$VAR1 = {
'DSA' => 'Down',
'FTP_Servers' => 'Alive',
'SDFW' => 'Alive',
'serverpro' => 'Down',
'OID' => 'Alive',
'OXQA' => 'Alive',
'www-daemon' => 'Alive'