Regex to remove unnecessary period in Chinese translation - regex

I use a translator tool to translate English into Simplified Chinese.
Now there is an issue with the period.
In English at the finish point of a sentence, we use full stop "."
In Simplified Chinese, it is "。"which looks like a small circle.
The translation tool mistakenly add this "small circle" / full stop to every major subtitles.
Is there a way to use Regex or other methods to scan the translated content, and replace any "small circle" / Chinese full stop symbol when the line has only 20 characters or less?
Some test data like below
<h1>这是一个测试。<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试。
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
It shall turn into:
<h1>这是一个测试<h1>
这是一个测试,这是一个测试而已,希望去掉不需要的。
测试
这是一个测试,这是一个测试而已,希望去掉不需要的第二行。
Difference:
Line 1 it only has 10 characters, and shall have Chinese full stop removed.
Line 4 is a sub heading, it only has 4 characters, and shall have full stop removed too.
By the way, I was told 1 Chinese word is two English characters.
Is this possible?

I'm using the approach 2
Second: maybe this one is more accurate: if there is no comma in this line, it should not have a full stop.
to determine whether a full stop 。 should be removed.
Regex
/^(?=.*。)(?!.*,)([^。]*)。/mg
^ start of a line
(?=.*。) match a line that contains 。
(?!.*,) match a line that doesn't contain ,
([^。]*)。 anything that not a full stop before a full stop, put it in group 1
Substitution
$1
Check the test cases here
But do mind this only removes the first full stop.
If you want to remove all the full stops, you can try (?:\G|^)(?=.*。)(?!.*,)(.*?)。 but this only works for regex engines supports \G such as pcre.
Also, if you want to combine the two approaches(a line has no period , and the length is less than 20 characters), you can try ^(?=.{1,20}$)(?=.*。)(?!.*,)([^。]*)。

Related

gvim syntax highlight for different types of lines

I've done several syntax highlighting files for simple custom formats in the past (even changing the format a bit to be capable of making the syntax file basing on my skills, in effects).
But this time I feel confused and I will appreciate some help.
The file format is (obviously) a text file where every line contain three distinct elements separated by spaces, they can be "symbols" (names containing a series of alphanumerical chars plus hyphens) or "string" (a series of any chars, spaces included, but not pipes).
Strings can be only at start or end of a line, the middle element can be only a symbol. And string are delimited by a pipe at the end if it is the first element and at the start if it is the last element.
But a line can be also all symbols, string first and rest symbols, and string last and rest symbols.
Strings are always followed by a pipe if they are the first element, or
with a pipe as prefix if they are the last element.
Examples:
All symbols
this-is-a-symbol another-one and-another
First string
This is a string potentially containing any char| symbol symbol
Last string
symbol symbol |A string at the end of the line
First and last as strings
This is a string| now-we-have-a-symbol |And here another string
This four examples are the only possibilities available for a correct formatting.
All symbols need to be colored differently, a specific color for first element, a specific color for second, and one for third.
But strings will have one unique different color regardless of position.
If the pipe chars can be "dimmed" with a color similar (not precisely the same) to background this will be a big plus. But I think I can manage this myself.
A line in the file not like the ones showed will have to be highlighted as an error (like red background).
Some help?
ps: stackoverflow apply a sort of syntax highlighting to my examples which can be misleading
I have found a simpler approach than what I initially thought was necessary in terms of regular expressions. At end I just need to match the first element and the last, how can I've not think of that... So this is my solution, it seems to work well for my specifics. It only doesn't highlight bad formatted lines. Good enough for now. Thanks for the patience and the attention.
" Vim syntax file
" Language: ff .txt
if exists("b:current_syntax")
finish
endif
setlocal iskeyword+=:
syn match Asymbol /^[a-zA-Z0-9\-]* /
syn match Csymbol / [a-zA-Z0-9\-]*$/
syn match Astring /^.*| /
syn match Cstring / |.*$/
highlight link Asymbol Constant
highlight link Csymbol Statement
highlight link Astring Include
highlight link Cstring Comment
let b:current_syntax = "ff"

searching for a particular string on a particular line position efficiently in VIM

As all of you know, Vim is an awesome tool for doing things like this. I'm searching for errors in some reporting text. Say for example I'm looking for the string "0000" in the text I would enter the command ":s/0000". By default, it highlights all instances in which that sequence, and I mean all. The good part is I know that, for example, the string begins on the 11th position in each line.
What I would like to know is there a command in which I can globally search ONLY for the string "0" in which it occurs on ONLY on the 11th position in each line?
I appreciate your time. Thank you.
Vim has some special regular expression atoms for that. For a screen column, there's \%v. So, to search for 0000 at exactly column 11, you'd use
/\%11v0\{4}
There are also variants for less-than (\%<v) and greater-than, as well as similar atoms for byte counts and line numbers. See :help /\%l and following paragraphs.
This is an easy one: You search the regular expression:
^ : (start of line)
.\{10} : exactly 10 arbitrary characters
0000 : the string supposed to start at column 11
so what you enter is:
/^.\{10}0000

EditPad: Need a regex that handles multiple possible data formats

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.
I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.
Here is some sample data:
3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre
Here is how the above data needs to end up:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
So, there are numerous rules, such as:
In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
If a question mark is present, everything from the mark onwards should be removed.
If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
If the string contains 2 underscores, averything after the first numeric string up to a dash
(if present) should be removed, and the string "sm-" should be prepended.
Additionally there is other data that must be left untouched, including but not limited to:
113535|24905|24905
as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)
This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.
Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.
That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.
Here is how.
The solution below is tested in EPP.
Regex 1
Press Ctrl + Sh + F to enter Search / Replace mode
Enter the following Search and Replace in the appropriate boxes
At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1
Search:
(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)
Replace:
\1\2\3\4\5
Regex 2
Same 1-2-3 steps as above.
Search
^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
Replace
sm-\1\2
Chaining Regex 1 and Regex 2
Top menu: Macros, Record Macro, give it a name.
Click the Favorite searches pulldown, select Regex 1
Hit Replace All.
Click the Favorite searches pulldown, select Regex 2
Hit Replace All.
Macros, Stop recording.
Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.
Testing This
I have tested my "Jonathan macro" on your input. Here is the result:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
Try this:
Toggle the Search Panel : SHIFT+CTRL+F
SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
REPLACE: $1
Check REGEX and WORDS
Click Replace All or Hit CTRL+ALT+F3
Check the image below:

Regular Expression - Matching and extracting complicated conditions

I'm trying to write a regular expression that will match these conditions:
Maximum of 8000 characters (any characters, including "\r\n")
Maximum of 10 lines (separated by \r\n).
to extract from the matched text only the first 4 lines.
Can't find a good way do it...:/
Thanks!!
Regular expressions are not what you need. They are used to match a certain pattern, not a certain length. If you are holding the data in a string, myString.length <= 8000 is all you need for the character count (using the correct syntax for your language, of course). For the number of lines, you will have to count the number of \r\n sequences in your string (can be done iteratively). To get the first four lines, just find the 4th \r\n and get everything before that with a substring method.
Description
This expression does the following:
validates the input string is between zero and 8,000 characters
validates there are at most 10 line of new line delimited text
then captures the first 4 new line delimited lines of text
\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r\n\Z]+){0,4} This requires options: m multiline, and s dot matches all characters
Expanded
\A anchor to the begining of the string, this anchor allows the use of the s option which allows the . to match new line and line feed characters
(?=.{0,8000}\Z) look ahead and validate there are between zero and 8000 characters
(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z) look ahead and validate there are no more then 10 new line delimited lines
(?:^.*?[\r\n\Z]+){0,4} match the first 4 lines of text
PHP Code Example:
You didn't specify a language so I'm including this PHP example to show how it works and the sample output.
Input Text
This input test is 8 lines of new line delimited strings. There are only 1779 characters here.
Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
the Italic Mountains, she had a last view back on the skyline of her hometown Bookmarksgrove, the headline of Alphabet Village and the subline of her own road, the Line Lane. Pityful a rethoric question ran over her cheek, then
she continued her way. On her way she met a copy. The copy warned the Little Blind Text, that where it came from it would have been rewritten a thousand times and everything that was left from its origin would be the word "and"
and the Little Blind Text should turn around and return to its own, safe country. But nothing the copy said could convince her and so it didn’t take long until a few insidious Copy Writers ambushed her, made her drunk with Longe
and Parole and dragged her into their agency, where they abused her for their projects again and again. And if she hasn’t been rewritten, then they are still using her.
Code
<?php
$sourcestring="your source string";
preg_match('/\A(?=.{0,8000}\Z)(?=(?:^.*?(?:\r|\n|\Z)){0,10}\Z)(?:^.*?[\r|\n\Z]+){0,4}/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
$matches Array:
(
[0] => Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small
river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about
the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar. The Big Oxmox advised her not to do so, because there were
thousands of bad Commas, wild Question Marks and devious Semikoli, but the Little Blind Text didn’t listen. She packed her seven versalia, put her initial into the belt and made herself on the way. When she reached the first hills of
)

Separating out a list with regex?

I have a CSV file which has been generated by a system. The problem is with one of the fields which used to be a list of items. An example of the original list is below....
The serial number of the desk is 45TYTU
This is the second item in the list
The colour of the apple is green
The ID code is 489RUI
This is the fourth item in the list.
And unfortunately the system spits out the code below.....
The serial number of the desk is 45TYTUThis is the second item in the listThe colour of the apple is greenThe ID code is 489RUIThis is the fourth item in the list.
As you can see, it ignores the line breaks and just bunches everything up. I am unable to modify the system that generates this output so what I am trying to do is come up with some sort of regex find and replace expression that will separate them out.
My original though would be to try and detect when an upper case letter is in the middle of a lower case word, but as in one of the items in the example, when a serial number is used it throws this out.
Anyone any suggestions? Is regex the way to go?
--- EDIT ---
I think i need to simplify things for myself, if I ignore the fact that lines that end in a serial number will break things for now. I need to just create an expression that will insert a line break if it detects that an upper case letter is being used after a lower case one
--- EDIT 2 ---
Using the example given by fardjad everything works for the sample data given, the strong was...
(.(?=[A-Z][a-z]))
Now as I test with more data I can see an issue appearing, certain lines begin with numbers so it is seeing these as serial numbers, you can see an example of this at http://regexr.com?2vfi5
There are only about 10 known numbers it uses at the start of the lines such as 240v, 120v etc...
Is there a way to exclude these?
That won't be a robust solution but this is what you asked. It matches the character before an uppercase letter followed by a lowercase one. You can simply use regex replace and append a new line character:
(.(?=[A-Z][a-z]))
see this demo.
You could search for this
(?<=\p{Ll})(?=\p{Lu})
and replace with a linebreak. The regex matches the empty space between a lowercase letter \p{Ll} and an uppercase letter \p{Lu}.
This assumes you're using a Unicode-aware regex engine (.NET, PCRE, Perl for example). If not, you might also get away with
(?<=[a-z])(?=[A-Z])
but this of course only detects lower-/uppercase changes in ASCII words.