Regex working in notepad++ but not in python - regex

am trying this regex (WVDC)((?:.*\r\n){1}) in notepad++ and it's working, but when I do the same in python it won't
text is
Above 85°C the rated (DC/AC) voltage must be derated at per 1.5%/2.5%°C
WVDC: 400 Volts DC
SVDC: 600 Volts DC
python code
re.search(r'(WVDC)((?:.*\r\n){1})',txt)

The following script is working for me in Python:
input = """Above 85°C the rated (DC/AC) voltage must be derated at per 1.5%/2.5%°C
WVDC: 400 Volts DC
SVDC: 600 Volts DC"""
result = re.findall(r'(WVDC).*\r?\n', input)
print(result)
['WVDC']
Note that the only substantial change I made to the regex pattern was to make the carriage return \r optional. So it seems that multiline strings in Python, perhaps what your source uses, carry only newlines, but not carriage returns. In any case, using \r?\n to match newlines is generally a good idea, because it can cover both Unix and Windows line endings at the same time.

You haven't shown a reproducible example, but opening files in Python in text mode will convert \r\n to \n. Notepad++ maintains the exact line endings.
Removing \r (or making it optional) from the regex should fix the problem in Python. You could also open the file in binary mode, but processing text in text mode is recommended.

Related

Regex to remove unnecessary period in Chinese translation

I use a translator tool to translate English into Simplified Chinese.
Now there is an issue with the period.
In English at the finish point of a sentence, we use full stop "."
In Simplified Chinese, it is "。"which looks like a small circle.
The translation tool mistakenly add this "small circle" / full stop to every major subtitles.
Is there a way to use Regex or other methods to scan the translated content, and replace any "small circle" / Chinese full stop symbol when the line has only 20 characters or less?
Some test data like below
<h1>这是一个测试。<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试。
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
It shall turn into:
<h1>这是一个测试<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
Difference:
Line 1 it only has 10 characters, and shall have Chinese full stop removed.
Line 4 is a sub heading, it only has 4 characters, and shall have full stop removed too.
By the way, I was told 1 Chinese word is two English characters.
Is this possible?
I'm using the approach 2
Second: maybe this one is more accurate: if there is no comma in this line, it should not have a full stop.
to determine whether a full stop 。 should be removed.
Regex
/^(?=.*。)(?!.*,)([^。]*)。/mg
^ start of a line
(?=.*。) match a line that contains 。
(?!.*,) match a line that doesn't contain ,
([^。]*)。 anything that not a full stop before a full stop, put it in group 1
Substitution
$1
Check the test cases here
But do mind this only removes the first full stop.
If you want to remove all the full stops, you can try (?:\G|^)(?=.*。)(?!.*,)(.*?)。 but this only works for regex engines supports \G such as pcre.
Also, if you want to combine the two approaches(a line has no period , and the length is less than 20 characters), you can try ^(?=.{1,20}$)(?=.*。)(?!.*,)([^。]*)。

Removing new lines from a text file in Notepad++

I need to replace all the strings that look like this:
<\name>
for a TAB
name can be anything from 3 to 15 characters long
I've managed to do it by doing search <.*> replace with \t
Now I need to replace any new lines with a single TAB i.e. remove the new line. For some reason Ultraedit doesn't recognise the new line in the search box. I've used \r and \n, but none of them works.
This is an example of the file, after the search and replace:
1
101
54651
150756
282
506
398
2759
59.62
35737
65
I want to get all that in a single line separated by tabs.
Any ideas?
As you're using Notepad++ I'll assume you're on Windows.
This means the text files you're using were likely created on a DOS type system (including Windows...) and therefore terminate lines with \r\n rather than a single \n like you might find on a UNIX system.
Try searching for that instead.

Sublime Text macro to find and replace file path characters on current line

I use Sublime 2 for developing R and PHP code, although I imagine this shortcut would be useful for other languages.
If I copy the path of a file from Windows Explorer / XYPlorer (or other source) it has backslashes for directories. When entering a path into the source code, it needs forward slashes.
Sublime has some reasonably powerful macro commands, but I cannot think of a combination that would be able to:
take the string of text on the current line
replace all instances of '\' and replace them with '/'
Here is the workflow that I envisage:
Locate my filename in Explorer and copy its path
In Sublime, write a line of code and paste in the path
Hit a keyboard shortcut, say Ctrl+Shift+\, and all back slashes are converted to forward slashes
The result:
myPath = "E:\WORK\Code\myFile.csv";
Becomes:
myPath = "E:/WORK/Code/myFile.csv";
Without running the risk of backslashes elsewhere in the file being changed (e.g. \n characters), and without having to use multiple key presses or mouse clicks.
I imagine this would be possible with Regex. Two things I am no expert in are Sublime macros or regex, so I wonder if anyone else knows the magical commands that would achieve this?
I tried this for about 15 minutes. A few things:
Sublime text 2 doesn't allow for find/replace with macros
Sublime text 3 doesn't allow for 'find in selection'
So, I think you are kind of beat right now other than writing a plugin, which would be fairly straightforward.
This works for Sublime Text 3:
Type r before the string to tell python to read the directory as a raw string.
This way all the backslashes are read as slashes instead of 'ignore next character' (default meaning of \ in python)
Example
myPath = r"E:\WORK\Code\myFile.csv"
Python should now read the \ as /

Line terminators

What are difference between:
\r\n - Line feed followed by carriage return.
\n - Line feed.
\r - Carriage Return.
They are the line terminator used by different systems:
\r\n = Windows
\n = UNIX and Mac OS X
\r = Old Mac
You should use std::endl to abstract it, if you want to write one out to a file:
std::cout << "Hello World" << std::endl;
In general, an \r character moves the cursor to the beginning of the line, while an \n moves the cursor down one line. However, different platforms interpret this in different ways, leading to annoying compatibility issues, especially between Windows and UNIX. This is because Windows requires an \r\n to move down one line and move the cursor to the start of the line, whereas on UNIX a single \n suffices.
Also, obligatory Jeff Atwood link: http://www.codinghorror.com/blog/2010/01/the-great-newline-schism.html
Historical info
The terminology comes from typewriters. Back in the day, when people used typewriters to write, when you got to the end of a line you'd press a key on the typewriter that would mechanically return the carriage to the left side of the page and feed the page up a line. The terminology was adopted by computers and represented as the ascii control codes 0xa, for the linefeed, and 0xd for the carriage return. Different operating systems use them differently, which leads to problems when editing a text file written on a Unix machine on a Windows machine and vice-versa.
Pragmatic info
On Unix based machines in text files a newline is represented by the linefeed character, 0xa. On Windows both a linefeed and carriage return are used. For example when you write some code on Linux that has the following in it where the file was opened in text mode:
fprintf(f, "\n");
the underlying runtime will insert only a linefeed character 0xa to the file. On Windows it will translate the \n and insert 0xd0xa. Same code but different results depending on the operating system used. However this changes if the file is opened in binary mode on Windows. In this case the insertion is done literally and only a linefeed character is inserted. This means that the sequence you asked about \r\n will have a different representation on Windows if output to a binary or text stream. In the case of it being a text stream you'll see the following in the file: 0xd0xd0xa. If the file was in binary mode then you'll see: 0xd0xa.
Because of the differences in how operating systems represent newlines in text files text editors have had to evolve to deal with them, although some, like Notepad, still don't know what to do. So in general if you're working on Windows and you're given a text file that was originally written on a Unix machine it's not a good idea to edit it in Notepad because it will insert the Windows style carriage return linefeed (0xd0xa) into the file when you really want a 0xa. This can cause problems for programs running on old Unix machines that read text files as input.
See http://en.wikipedia.org/wiki/Newline
Different operating systems have different conventions; Windows uses \r\n, Mac uses \r, and UNIX uses \n.
\r
This sends the cursor to the beginning column of the display
\n
This moves the cursor to the new line of the display, but the cursor stays in the same column as the previous line.
\r\n
Combine 1 and 2. The cursor is moved to the new line, and it is also moved to the first column of the display.
Some compilers prints both a new line and carriage return when you specify only \n.

Does the Wrap function in ColdFusion insert CR/LFs?

I have the need to do some word wrapping with a few considerations:
Source file is MS WORD
Copy and paste the text into a textarea in a cfform.
Use #wrap(theTextVar,80)# to dump out the text 80 characters
The text is uploaded to a legacy system which needs ansi or ascii chars uploaded.
Everything seems to work okay, I just wanted to confirm see if anyone else has had luck doing this and if they know if a CR / LF is entered after each line in the outputted text (Step 3)?
From the docs on wrap():
Uses the operating-system specific
line break: newline for UNIX, carriage
return and newline on Windows.
So if you are doing this on a Windows box, then the answer is yes.
Tried this?
<cffile action="write" file="i_will_show_the_secret_if_you_open_me_in_text_editor.html" output="#wrap(theTextVar,80)#" />