sed to remove certain spaces in the middle

sed to remove certain spaces in the middle - regex

I have a text file like this:
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0
The format is: columns 1 to 10, 11 to 20, 21 to 30 are x,y,z coordinates in (10.4) format, i.e. length=10, 4 digits after the decimal point; column 31 is always a space; columns 32 to 32 are the atom type; the remaining columns are not important.
However, for some unknown reason, the atom type field is right-shifted by two columns, like this:
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0
How to use the sed command and regular expression to match these lines and delete the two extra spaces?

sed -r 's/(.{30}) /\1/' will do the trick.
Group the first 30 characters, match two additional spaces, replace the whole with the grouped characters.

If you don't mind using neither sed nor regular expressions you can just use cut to remove the 2 offending characters:
$ cut --complement -c31,32 file
6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0
0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0
-2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0

Related

Notepad++ regex - search only in certain column all numbers between two numbers?

i have now 250 million lines of text from a database.
I want to highlight only certain values, that are only in the third column.
I use this \b1011(3[1-9]\d[1-9]|[4]\d\d\d|5[0-8][0-3][0-6])\b for highlight all values between 10113101 to 10115836.
Can one exclude the numbers from column 4?
Edit: a column means for me the text between the spaces
1 2 3 4 5 ..... columns
307607 1317011864 10113101 -25 13135611 2700 0 0 0 12 0 0 0 walk029h.rwx
2264 910115836 10114632 -15 20111192 900 0 0 0 11 0 0 0 walk029.rwx
326169 1010523891 10115836 -1 20911192 0 0 0 0 11 0 0 0 walk12h.rwx
38718 826265392 10113628 0 10114603 2700 0 0 0 11 0 0 0 street2.rwx
241512 1317011864 636346 0 10113987 900 0 0 0 12 0 0 0 walk029h.rwx
38718 826266129 10113448 0 10114310 900 0 0 0 10 0 0 0 tree5m.rwx
38718 826266243 10113898 0 10114810 900 0 0 0 10 0 0 0 tree9m.rwx

This pattern will capture the numbers you want in the third column only. Refer to capture group 1 for their values.
^(?:\S+\s){2}\b(1011(?:3[1-9]\d{2}|4\d{3}|5[0-8][0-3][0-6]))\b.*
All I did was modify yours to add the prefix and removed some redundancy.

False Acceptance Rate and False Rejection Rate calculation using a n*n confusion matrix

FAR and FRR are used to express the results of biometric devices. Below is the confusion matrix produced by biometric data produced in weka. I couldn't find any resources explaining the procedure to calculate FAR and FRR using a n*n confusion matrix. Any help explaining the procedure would be of great help. Thanks in advance!
Weka also gives these values, TP Rate, FP Rate, Precision, Recall, F-Measure and ROC Area. Please suggest if the required values can be calculated using these.
=== Confusion Matrix ===
a b c d e f g h i j k l m n o <-- classified as
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 | a = user1
0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 | b = user2
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | c = user3
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 | d = user4
0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 | e = user5
0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 | f = user6
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | g = user7
0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 | h = user9
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 | i = user10
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 | j = user11
0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 | k = user14
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 | l = user15
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 | m = user16
0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 | n = user17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 | o = user19

The accepted answer here by user "chl" has a reference to the Biometrics Literature: https://stats.stackexchange.com/questions/3489/calculating-false-acceptance-rate-for-a-gaussian-distribution-of-scores .
He says,
[the ROC curve] is a plot of (TAR=1-FRR, the false rejection rate) against false
acceptance rate (FAR).
However, commonly the ROC curve happens to be a plot of TP Rate as a function of False Positive Rate (FP Rate).
Seems you can use TP Rate and FP Rate.

Replace space with semicolon when more than one with regex

I am trying for about 2 hours, and I'm not sure whether what I want to do even works.
I have a large file with some data that looks like
43034452 LONGSHIRTPAIETTE 17.30
27.90
0110
COLOR : : : : :
: : :
-11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
43034453 LONG SHIRT PAI ETTE 16.40
25.90
0110
COLOR : : : : :
: : :
-3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
43034454 BASIC 4.99
8.90
0110
COLOR : : : : :
: : :
-5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(The file has 36k rows.)
What I want to do is to get this whole thing clean.
In the end, the rows should look like
43034452;LONGSHIRTPAIETTE;17.30;27.90;0110
43034453;LONG SHIRT PAI ETTE;16.40;25.90;0110
43034454;BASIC;4.99;8.90;0110
So there is a lot of data that I don't need. I'm using Notepad++ to do my regex.
My regex string looks like ([0-9]*)\s{6,}([A-Z]*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*) at the moment.
This brings me the first number followed by 6 spaces. (It has to be like this because some rows start with FF and FF are not letters. It's some kind of sign that I can't identify but if I let Notepad++ show all signs I see FF.)
So as a result I get
\1: 43034452
\2: LONGSHIRTPAIETTE
\3: 17.30
\4: 27.90
\5: 0110
like expected, but on the next row it stops on the space. If I add \s to the pattern, then it also selects all spaces after the word part. And I obviously can't say "only one space", can I?
So my question is, can I use regex to get a selection like the one I want?
If so, what am I doing wrong?

Try this:
([0-9]+)\s{6,}((?:[A-Z]+\ )+)\s*([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)
Note a few things:
Tightening the *s to + where this is appropriate, so you're enforcing some characters in those columns, or actual whitespace
The use of a non-capturing group to repeat one or more instances of a word then a space.

Use the below regex
([0-9]*)\s{6,}([A-Z]+(?:\s+[A-Z]+)*)\s*([0-9\.]*)\s*([0-9\.]*)\s*([0-9]*).*?(?=\n\S|$)
and then replace the match with \1;\2;\3;\4;\5
Don't forget to enable the DOTALL modifier s.
DEMO

Your approach is correct.. just replace * with + (more than one) in your regex.
/([0-9]+)\s{6,}([A-Z ]+)\s+([0-9\.]+)\s+([0-9\.]+)\s+([0-9]+)/g
See the DEMO.

Matlab: how to work with sparse keys to access sparse data?

I am trying to access the sparse mlf with the keys such as BEpos and BEneg where one key per line. Now the problem is that most commands are not meant to deal with too large input: bin2dec requires clean binary numbers without spaces but the regexp hack fails to too many rows -- and so on.
How to work with sparse keys to access sparse data?
Example
K>> mlf=sparse([],[],[],2^31,1);
BEpos=Cg(pos,:)
BEpos =
(1,1) 1
(2,3) 1
(2,4) 1
K>> mlf(bin2dec(num2str(BEpos)))=1
Error using bin2dec (line 36)
Binary string must be 52 bits or less.
K>> num2str(BEpos)
ans =
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K>> bin2dec(num2str('1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'))
Error using bin2dec (line 36)
Binary string must be 52 bits or less.
K>> regexprep(num2str(BEpos),'[^\w'']','')
Error using regexprep
The 'STRING' input must be a one-dimensional array
of char or cell arrays of strings.
Manually works
K>> mlf(bin2dec('1000000000000000000000000000000'))
ans =
All zero sparse: 1-by-1

Consider a different approach using manual binary to decimal conversions:
pows = pow2(size(BEpos,2)-1 : -1 : 0);
inds = uint32(BEpos*pows.')
I haven't benchmarked this, but it might work faster than bin2dec and cell arrays.
How it works
This is pretty simple: the powers of 2 are calculated and stored in pows (assuming the MSB is in the leftmost position). Then they are multiplied by the bits in the matching positions and summed to produce the corresponding decimal values.

Try to index with this:
inds = uint32( bin2dec(cellstr(num2str(BEpos,'%d'))) );

Form textarea behavior for carriage return / new line

I'm working on a project with symfony2, doctrine2 and sonata admin bundle.
When i edit field of an entity in a textarea, it saves it in a postgres database with \r carriage return. This is a problem for me as i use this field with java applets and the format must be correct.
Here is an example that satisfy me :
-ISIS- 02210509232D
42 45 0 0 0 0 0 0 0 0 0
1.6288 -9.1522 -0.6175 O 0 0 0 0 0 0 0 0 0
-0.0179 1.3546 0.0193 O 0 0 0 0 0 0 0 0 0
0.4111 -6.3640 1.1717 C 0 0 0 0 0 0 0 0 0
1.3655 -6.8310 0.0373 C 0 0 0 0 0 0 0 0 0
And here is what i get when editing the field of an other one :
Mrv0541 01081315253D \r
\r
9 8 0 0 0 0 999 V2000\r
-0.0166 1.3706 0.0096 N 0 0 0 0 0 0 0 0 0 0 0 0\r
-1.1791 -0.7076 0.0095 N 0 0 0 0 0 0 0 0 0 0 0 0\r
1.1396 -0.6403 -0.0123 N 0 0 0 0 0 0 0 0 0 0 0 0\r
How can i be sure that the field i edit in a textarea will be in the correct format ? (I don't want these \r)
EDIT:
I checked with javascript on the form page and there is no carriage return at all in the text. I disabled the trim for the field in form.
Where can i check if it is symfony2 or doctrine2 that play a role in saving this entity field with carriage return in place of newline character ?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed to remove certain spaces in the middle - regex

sed -r 's/(.{30}) /\1/' will do the trick. Group the first 30 characters, match two additional spaces, replace the whole with the grouped characters.

If you don't mind using neither sed nor regular expressions you can just use cut to remove the 2 offending characters: $ cut --complement -c31,32 file 6.2341 -0.4024 -2.0936 Cl 0 0 0 0 0 0 0 0 0 0 0 0 0.1148 -3.7525 1.0392 S 0 0 0 0 0 0 0 0 0 0 0 0 -2.5441 -0.8745 1.3714 F 0 0 0 0 0 0 0 0 0 0 0 0

Related

Notepad++ regex - search only in certain column all numbers between two numbers?

False Acceptance Rate and False Rejection Rate calculation using a n*n confusion matrix

Replace space with semicolon when more than one with regex

Matlab: how to work with sparse keys to access sparse data?

Form textarea behavior for carriage return / new line

Categories

Resources