format text using regex

format text using regex - regex

I have the following text
1
0
0
0
0
0
ASET LANCAR
Neraca
1
1
0
0
0
0
KAS DAN SETARA KAS
Neraca
1
1
1
0
0
0
Kas
 
Buku Besar
using regex how can I turn that text into like:
100000,ASET LANCAR,Neraca
110000,KAS DAN SETARA KAS,Neraca
111000,Kas,Buku Besar
in other words I want to turn the original string into comma separated value (CSV). honestly I have no idea about how the regex would look like.

You will need minimum two steps to achieve this.
First replace (?<=\d)\R(?=\d)|(\s){2,} with \1 and you will get following text,
100000
ASET LANCAR
Neraca
110000
KAS DAN SETARA KAS
Neraca
111000
Kas
Buku Besar
Once you have this text, you can use this regex (?<=\w)\R(?=[a-zA-Z]) and replace it with a comma , and you will get your desired following text,
100000,ASET LANCAR,Neraca
110000,KAS DAN SETARA KAS,Neraca
111000,Kas,Buku Besar
Initial text,
After first replace,
After second replace you have your desirable text

Related

TCL: How the regex for every line should look like?

In TCL, in output I have something like this:
ABBAA 1 BAABA 1 DNS3 0 0 200 300 400 500 0 0
ABBAA 1 BAABA 1 DNS1 0 0 200 300 400 500 0 0
ABBAA 1 BAABA 1 DNS7 0 0 200 300 400 500 0 0
ABBAB 1 BAABB 1 DNS5 0 0 200 300 400 500 0 0
ABBAB 1 BAABB 1 DNS3 0 0 200 300 400 500 0 0
I would like to sort this table alike dataset by fourth column ascending (so the first one will be row with DNS1UP1, then DNS2UP2 etc.) I figured out that regexp will be easiest method by looking for string with "DNS.." in it. But my method doesn't work exacly how I thought, because it is matching only one line or no line at all.
My method:
regexp "ABB.*DNS1.*?\N"
ABB - match beginning of new line
.* - every character between ABB and DNS..
DNS1 - match the main looking for word
.* - every character between DNS... and new line symbol
?\n - non-greedy occurence of new line
Where am I wrong?

If you have a list of lines in such a regular format, you can just lsort them… with the right options. In particular, -dictionary is good for mixed text/numbers and -index 4 lets you choose the column to sort by.
set sortedLines [lsort -index 4 -dictionary $unsortedLines]
The only possible reasonable use of regexp in this would have been in preparing the data for the sort, but that string which you provided is already sortable (assuming you've done a split $data "\n" on it to actually convert it into a list of lines and are not just using a big ol' string).

Notepad++ search combination in lines

I am looking for a specific combination in a txt file that contains multiple lines (Notepad ++). The structure of a line I am looking for is as follows:
xxxxxx N N -1 -1 -1 N (end line)
So I first have an identifier of 6 or more characters, followed by 6 numbers (N) spaced by a tab. N can be values 1, 0 or -1.
I am looking for those lines that contain '-1' in position 3, 4 and 5. The other positions can take any of the 3 values.
I have searched online and applied searches such as:
\t-?\t-?\t-1\t-1\t-1\t-?
\t?.\t?.\t-1\t-1\t-1\t?.
t?.\t?.\t-1\t-1\t-1\t?.\n
\t-1\t-1\t-1\t?.\n
Yet, the last N in the line is not taken into account, so that if its value is 0 for example, that line will not be selected.
What is the way to write this search? I understand Notepad ++ is written in C++.

Can you try to follow this pattern?:
^([a-zA-Z0-9]{6,})\s*(-1|0|1)\s*(-1|0|1)\s*((-1\s*?){3})\s*(-1|0|1)\s?
https://regex101.com/r/yM5xD3/2
Explanation:
^: Start of the line.
([a-zA-Z0-9]{6,}): Any character six or more times.
\s*: space/tab/newLine zero o more times.
(-1|0|1): One of those numbers.
\s*: ...
(-1|0|1): One of those numbers.
((-1\s*?){3}): -1 one time followed by space/tab/newLine zero or more times. (The '?' means that the regex will try to get the less amount of \s as possible)
\s*: ..
(-1|0|1): ...
And the last \s?: looks for zero or one Space/tab/newLineCharacter

You can try the following regex:
^[a-zA-Z0-9]+\t(-1|0|1)\t(-1|0|1)\t[\-][1]\t[\-][1]\t[\-][1]\t(-1|0|1)$
I tried on the following sample and it worked for me.
xxxxxx 1 1 -1 -1 -1 1
xxxxxx 0 1 -1 -1 -1 0
test12 -1 1 -1 1 -1 0
xxxxxx 1 1 -1 -1 -1 0
test13 0 1 -1 -1 1 -1
Hope it helps.

How to replace specific characters of a string with tab in R

Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.

You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith

Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")

RegEx: Reject sub-portion of complicated expression

In the sample text below, I want to match groups of text (newlines and all) starting with a line defined by \nI.*' and including the subsequent lines starting with \nA, only if none of the intermediate lines contains "BOM=". I.e. in the example, I would want to match the first "device" and its following attributes, but not the second device, as shown in my comments (after #s).
I 657 device:THAT 2 1290 400 0 1 ' # Start matching here because no lines have "BOM="
A 1335 425 12 0 5 0 some text
A 1335 455 12 0 5 0 some text
A 1300 440 12 0 9 3 some text
A 1370 375 12 0 3 0 some text # Finish matching here
C 655 1 3 0
A 1370 450 12 0 3 3 #=2
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
I 318 device:THIS 2 300 1840 0 1 ' # Do not match again here because there's a line with "BOM="
A 320 1880 12 0 7 3 some text
A 320 1880 12 0 9 3 some text
A 380 1880 12 0 1 1 BOM=1,2
A 345 1865 12 0 5 0 some text
A 380 1830 12 0 3 0 some text
C 666 1 3 0
In the sample text, "some text" is various descriptors for electrical devices, e.g. "RATING=63MW", "REFDES=R123". It may contain whitespace but not newlines.
The furthest I've gotten yet is the expression
((\n|^)I((?!misc).)*?'\n)((A.*\n)*(A.*BOM=.*\n)(A.*\n)*)
which matches the opposite of what I want, i.e it finds the text blocks that DO contain BOM=. I thought I could switch this by changing (A.*BOM=.*\n) to (?!(A.*BOM=.*\n)) but this did not work.
I'm hoping to use this in Notepad++ when I'm done.

You can perhaps try this regex:
^I(?:(?!misc).)*'\n(?!(?:A.*\n)*?A.*BOM=)(?:A.*\n)*
regex101 demo
I added a third block where the BOM= is instead on a line starting with C, where the device being matched because BOM= is not on the same line as the consecutive lines beginning with A.
Multiline by default matches on every line on Notepad++, so it's usually not necessary to have (^|\n), but you can revert it if you need it.
I also kept (?:(?!misc).)* in because you had it in your expression, although it doesn't have to do anything with your sample data.
(?!(?:A.*\n)*?A.*BOM=) is what's making the match fail when there's a BOM= in the lines. It's a negative lookahead which will prevent a match only if A.*BOM= matches after any number of lines of (?:A.*\n)*? (i.e. lines beginning with A).

I'm trying to do a search/replace using regex for mass replacing on Notepad++

I need to add a parameter for each code and name, i tried using (.+) or (.*) for each number, but it didnt work. Each space means that is a different number and not every space has the same width. Example from this:
Abanda CDP 192 129 58 0 0 0 2 3 3
2.998 0.013 33.091627 -85.527029 2582661
To this:
Abanda CDP |code1=192 |code2=129 |code3=58 |code4=0 |code5=0 |code6=0 |code7=2 |code8=3 |code9=3
|code9=2.998 |code10=0.013 |code11=33.091627 |code12=-85.527029 |code13=2582661

Try ([0-9.-]+). The reason .+ doesn't work is because . matches whitespace as well. The reason you can't just use \S+ (non-spaces) is because you only want to match the numbers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

format text using regex - regex

Related

TCL: How the regex for every line should look like?

Notepad++ search combination in lines

How to replace specific characters of a string with tab in R

RegEx: Reject sub-portion of complicated expression

I'm trying to do a search/replace using regex for mass replacing on Notepad++

Categories

Resources