How can I deal with Liblinear's output in c++? - c++

I'm trying to get liblinear to work in c++, but the library call to train(problem*, parameter*) is sending output to the terminal. Sometimes it says the optimization finished, other times it seems to be outputting internal state (why?). What does this output mean, and is it possible to suppress it or divert it to a log somewhere? I'm using boost::log in the rest of the program, and I'd like to control what the program displays. I'm running this on Ubuntu 12.10.
example output:
iter 1 act -6.742e-01 pre 1.191e-02 delta 3.443e-02 f 5.940e-02 |g| 1.730e-01 CG 1
cg reaches trust region boundary
iter 1 act -3.040e-02 pre 5.211e-03 delta 8.607e-03 f 5.940e-02 |g| 1.730e-01 CG 1
cg reaches trust region boundary
iter 1 act 5.453e-04 pre 1.442e-03 delta 6.791e-03 f 5.940e-02 |g| 1.730e-01 CG 1
cg reaches trust region boundary
iter 2 act 6.299e-04 pre 5.985e-04 delta 8.812e-03 f 5.886e-02 |g| 2.525e-01 CG 2
cg reaches trust region boundary
iter 3 act 2.610e-04 pre 2.449e-04 delta 1.583e-02 f 5.823e-02 |g| 4.313e-02 CG 2
iter 4 act 1.510e-04 pre 1.585e-04 delta 1.583e-02 f 5.796e-02 |g| 2.927e-02 CG 4
or
..*
optimization finished, #iter = 25
Objective value = -0.332340
nSV = 173
for the train call, my parameters are:
solver_type = L2R_L2LOSS_SVR
eps = 0.001
C = 0.02
nr_weight = 0
weight_label = nullptr
weight = nullptr
p = 0.005
My input data has some 10,000 ~ 100,000 data points, each with 62 feautres.
Also, the output model has 124 weights.. I'm assuming that's 62 weights for the set represented by the positive labels and 62 for the negative labels? How do I know which order they are in? model->label is NULL for my solver_type.

Related

Write results between different text while adapting spaces in Fortran

I try in a small code to write output results with numerical values between various text.
For the moment, I do :
! Print results
write(*,*)
write(*,*) ' Time step = ',dt
write(*,*)
write(*,1001) epsilon,step
write(*,*)
write(*,*) ' Problem size = ',size_x*size_y
write(*,*)
write(*,1002) elapsed_time
write(*,*)
write(*,*) ' Computed solution in seq.dat file '
write(*,*)
! Formats available to display the computed values on the grid
1001 format(' Convergence = ',f11.9,' after ',i9,' steps ')
1002 format(' Wall Clock = ',f15.6)
which produces at the execution :
Time step = 0.000003755783907217
Convergence = 0.100000000 after 8882 steps
Problem size = 24576
Wall Clock = 5.213814
Computed solution in Seq.dat
My issue is about the line "Wall Clock = 5.213814", I would like to get only one space juste after "Wall Clock =" before the value "5.213814". Currently, I think these multiple spaces that I get come from the "f15.6" with 1002 format(' Wall Clock = ',f15.6).
Here's what I want to get (with another value for steps) :
Time step = 0.000003755783907217
Convergence = 0.100000000 after 20910988821 steps
Problem size = 24576
Wall Clock = 5.213814
Computed solution in Seq.dat
I have set "f15.6" since I can get high number for "Wall Clock", same thing for espilon and step variables.
I don't know in all cases how to set just one space between words and values to write between them, as when I printf, in C language, different values and words on the same line.
I know there's a simple solution but can't find it.
UPDATE 1 :
I tried the solution indicated in the first answer.
Here's what I have done :
write(*,1001) epsilon,step
write(*,1002) elapsed_time
1001 format(' Convergence = ',f0.9,' after ',i9,' steps ')
1002 format(' Wall Clock = ',f0.6)
and I get :
Convergence = .100000000 after 8882 steps
Problem size = 24576
Wall Clock = 2.492813
As you can see, "Convergence" value is .100000000 instead of 0.100000000 (leading zero has disappeared).
And what about the integers values, can I write "i0" to have as few as possible ?
Thanks
Modern Fortran compilers understand a 'length' of 0 to mean: As few as possible:
program write_format
use iso_fortran_env, only: real64
implicit none
print 1001, 5.213814
print 1001, 12345678.901234_real64
1001 format("Wall Clock = ", f0.6)
end program write_format
Output:
Wall Clock = 5.213814
Wall Clock = 12345678.901234
Cheers
Usually it's not liked to update the question after the answer to ask additional questions, but since they're quite similar, I think it's okay.
Firstly, yes, format I0 means as few digits as necessary, and probably is what you want.
The second part is trickier, it seems to boil down to 'at least that many digits, but more if needed' -- and I don't think there's a format specifier for that (but I might be wrong).
I'd probably cheat and use something like this:
if (epsilon < 10.) then
write(*, 1002) epsilon
else
write(*, 1003) epsilon
end if
1002 format("Convergenge = ", f11.9)
1003 format("Convergence = ", f0.9)
But then again, I also found this answer quite intuitive: How to pad FORTRAN floating point output with leading zeros?
Adapted for you, it would mean splitting the floating point number into an integer and the rest, and putting it back together again:
write(*, 1002) int(epsilon), epsilon-int(epsilon)
1002 format("Convergence = ", I0, F0.9)
this is a bit cumbersome, but one way to get minimum width and preserve the lead zero is to use an internal write like this:
character*30 val
write(val,'(f11.9)')0.1d0
write(*,'(3a,i0,a)')'converge = ',trim(adjustl(val)),' after ',32432,' steps'
converge = 0.100000000 after 32432 steps

Conditional Regex for multiple matches in a line

I've got a regex that is responsible for matching the pattern A:B in lines where you might have multiple matches (i.e. "A:B A: B A : B A:B", etc.) The problem lies in the complexity of what A represents.
I'm using the regex:
\b[\w|\(|\)+]+\s*:(?:(?![\w+]+\s*:).)*
to match items in:
Data_1: Tutor Elementary: 10 a F Test: 7.87 sips
Turning 1 Data (A Run), Data: 0.0 10.0 10.0 17.3 0.0
Turning 2 Data (A Run), Data2: 0.0 6.8 0.0 6.8 6.8
Data_1: Tutor Pool: Data2: A B C
Turning 2 (A Run), ABSOLUTE: 368 337 428 0 2 147
Data_4 : 4AZE Localization : 33.14 lat -86 long
Time: 0.75 Data Scenario: 3121.2
The question is this, if you examine this setup (I use https://regex101.com/), lines 2,3,5 don't return exactly what I'm looking for. Where the match is the first in the line, I want it to grab everything from the beginning of the line to the first ':'. Is this type of conditional regex possible? I've tried every possible way I could imagine, but I haven't been successful yet.
Thanks in advance!
A little complex, but try this here
^(.*?:.*?)(\b\w+\b\s*:.*?)\b\w+\b:.*$|^(.*?:.*?)\b\w+\b\s*:(.*?)$|^(.*)$

How can we use clustering results in weka ?

I am using Weka for my internship but I have a little knowledge about data mining. So, maybe someone knows how can I apply the following results on my data-sets to get all data by cluster ? The method that I use now is to compute distances between my attributes and the mean value of each cluster then I classify them by the nearest value. But this method is too rough for me .
=== Run information ===
Scheme:weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: wcet_cluster6 - Copie-weka.filters.unsupervised.attribute.Remove-R1-3,5-weka.filters.unsupervised.attribute.Remove-R5-12
Instances: 467
Attributes: 4
max
alt
stmt
bb
Test mode:evaluate on training data
=== Model and evaluation on training set ===
EM
Number of clusters selected by cross validation: 6
Cluster
Attribute 0 1 2 3 4 5
(0.28) (0.11) (0.25) (0.16) (0.04) (0.17)
==================================================================
max
mean 9.0148 10.9112 11.2826 10.4329 11.2039 10.0546
std. dev. 1.8418 2.7775 3.0263 2.5743 2.2014 2.4614
alt
mean 0.0003 19.6467 0.4867 2.4565 44.191 8.0635
std. dev. 0.0175 5.7685 0.5034 1.3647 10.4761 3.3021
stmt
mean 0.7295 77.0348 3.2439 12.3971 140.9367 33.9686
std. dev. 1.0174 21.5897 2.3642 5.1584 34.8366 11.5868
bb
mean 0.4362 53.9947 1.4895 7.2547 114.7113 22.2687
std. dev. 0.5153 13.1614 0.9276 3.5122 28.0919 7.6968
Time taken to build model (full training data) : 4.24 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 163 ( 35%)
1 50 ( 11%)
2 85 ( 18%)
3 73 ( 16%)
4 18 ( 4%)
5 78 ( 17%)
Log likelihood: -9.09081
Thanks for your help!!
I think no-one can really answer this. Some tips off the top of my head.
You have used the EM clustering algorithm, see animated gif on wikipedia page. From Weka's Documentation Synopsis:
"EM assigns a probability distribution to each instance which
indicates the probability of it belonging to each of the clusters. "
Is this complex output really what you want?
It also selects a number of clusters for you (unless you constrain that number).
In weka 3.7 you can use the unsupervised attribute filter "ClusterMembership" in the Preprocess dialog to replace your dataset with a result of the cluster assignments. You need to select one reference attribute, though. By default it selects the last one. This creates hard-to -interpret output.

Calculating the distance between characters

Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.

Fortran95 -- Reading from a formatted text file

I need to read some values from a table. These are the first five rows, to give you some idea of what it should look like:
1 + 3 98 96 1
2 + 337 2799 2463 1
3 + 2801 3733 933 1
4 + 3734 5020 1287 1
5 + 5234 5530 297 1
My interest is in the first four columns of each row. I need to read these into arrays. I used the following code:
program ----
implicit none
integer, parameter :: totbases = 4639675, totgenes = 4395
integer :: codtot, ks
integer, dimension(totgenes) :: ngene, lend, rend
character :: genome*4639675, sign*4
open(1,file='e_coli_g_info')
open(2,file='e_coli_g_str')
do ks = 1, totgenes
read(1,100) ngene(ks),sign(ks:ks),lend(ks), rend(ks)
end do
100 format(1x,i4,8x,a1, 2(5x,i7), 22x)
do ks = 1, 100
write(*,*) ngene(ks), sign(ks:ks),lend(ks), rend(ks)
end do
end program
The loop at the end of the program is to print the first hundred entries to test that they are being read correctly. The problem is that I am getting this garbage (the fourth row is the problem):
1 + 3 757934891
2 + 337 724249387
3 + 2801 757803819
4 + 3734 757803819
5 + 5234 757935405
Clearly, the fourth column is way off. In fact, I cannot find these values anywhere in the file that I am reading from. I am using the gfortran compiler for Ubuntu 12.04. I would greatly appreciate if somebody would point me in the right direction. I'm sure it's likely that I'm missing something very obvious because I'm new at Fortran.
Fortran formats are (traditionally, there's some newer stuff that I won't go into here) fixed format, that is, they are best suited for file formats with fixed columns. I.e. column N always starts at character position M, no ifs or buts. If your file format is more "free format"-like, that is, columns are separated by whitespace, it's often easier and more robust to read data using list formatting. That is, try to do your read loop as
do ks = 1, totgenes
read(1, *) ngene(ks), sign(ks:ks), lend(ks), rend(ks)
end do
Also, as a general advice, when opening your own files, start from unit 10 and go upwards from there. Fortran implementations typically use some of the low-numbered units for standard input, output, and error (a common choice is units 1, 5, and 6). You probably don't want to redirect those.
PS 2: I haven't tried your code, but it seems that you have a bounds overflow in the sign variable. It's declared of length 4, but then you assign to index ks which goes all the way up to totgenes. As you're using gfortran on Ubuntu 12.04 (that is, gfortran 4.6), when developing compile with options "-O1 -Wall -g -fcheck=all"