Notepad++ - Find duplicate lines by first word - regex

I've been searching on the internet to solve my problem and read through several Stack Overflow topics but I can't get it working.
So I have multiple files with translations. Each line contains one translation with a translation key in front of it.
The key is split from the translated phrase with a :. There can be multiple underscores in the key but no space. Each key has to be unique even if the translation phrase is different.
This is a small example how these files look:
CONFIRM: Conferma
FOR: per
YOU_WILL_RECEIVE: Riceverai
ENCHANTED: Incantato
ITEMS_AVAILABLE: Prodotti disponibili
CONTINUE: Continua
MONEY_PAID: Money Paid
GUI_OVERVIEW_OPENSHOPS_ACTION: Clicca per aprire i negozi
GUI_OVERVIEW_OPENSETTINGS_ACTION: Clicca per aprire le impostazioni
GUI_SHOPSETTINGS_BUY_LEFTACTION: **Tasto Sinistro** per **cambiare** il prezzo d'acquisto
GUI_SHOPSETTINGS_BUY_QACTION: **Premi Q** per disabilitare l'acquisto
ENCHANTED: Incantato premituro
GUI_SHOPSETTINGS_BUY_OTHERACTION: **Clicca** per abilitare l'**acquisto**
In this example the ENCHANTED key is duplicate even these keys have different translation phrases. I just want to see that this key is duplicated.
My plan is to match all these lines with a regex pattern with the help of notepad++ but if it's easier for you, it would also be okay if I have to use a script. Something like Batch or even a little NodeJS application.

In notepad++ you can use this regex to find the first occurrence of any key which is duplicated:
^(\w+):(?=.*\R\1:)
It looks for a sequence of word characters between the start of a line and a :, captured in group 1 and then asserts a positive lookahead for the same string starting a line again (\R matches a newline/crlf character). Note you need to have the . matches newline checkbox selected.

Related

Splitting name/value pairs with regex to ignore special characters based on surrounding characters

I have this regex that's worked well so far that splits 'name=value' pairs separated by a given character.
(?s)([^\s=]+)=(.*?)(?=\s+[^\s=]+=|\Z)
I know the separator, but the problem is in the example below (tab separated):
usrName=Wilma sev=4 cat=Detection CommandLine="C:\powershell.exe" -Enc 0ATQBpAG0AAcABDAHIAZQBkAHMAIgA= IOCValue= ProcessEndTime=2023-01-18 15:51:05
https://regex101.com/r/1wgVxs/5
Some values can have no value in the case of 'IOCValue' which works as expected, however some values like the CommandLine are giving me up to -Enc as one match and the remainder to the next pair as another.
What I'm hoping to get out from the above is:
usrName=Wilma
sev=4
cat=Detection
CommandLine="C:\powershell.exe" -Enc 0ATQBpAG0AAcABDAHIAZQBkAHMAIgA=
IOCValue=
ProcessEndTime=2023-01-18 15:51:05
But I'm getting:
usrName=Wilma
sev=4
cat=Detection
CommandLine="C:\powershell.exe" -Enc
0ATQBpAG0AAcABDAHIAZQBkAHMAIgA=
IOCValue=
ProcessEndTime=2023-01-18 15:51:05
Given I know the separator is a tab I think what I need is to only look for name=value pairs when they are at the start of the line or proceeded by the separator (tab). Is this possible?
Note, I can expect a space separator too, but I have a less performant and messy non-regex version I can send these too, so presume tab.
You may use this simplified regex:
(?s)([^\s=]+)=(.*?)(?=\t|\Z)
Updated RegEx Demo
Here, lookahead (?=\t|\Z) will make sure that value part is followed by either a tab character or end position.

RegEx in Notepad++ to find lines with less or more than n pipes

I have a large pipe-delimited text file that should have one 3-column record per line. Many of the records are split up by line breaks within a column.
I need to do a find/replace to get three, and only three, pipes per line/record.
Here's an example (I added the line breaks (\r\n) to demonstrate where they are and what needs to be replaced):
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\r\n
\r\n
on to multiple lines|More text|\r\n
09-1234AS|\r\n
||\r\n
\r\n
56-1234|Some text|Some more text\r\n
|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
The caveat is that I need to retain those mid-record line breaks for the target system. They need to be replaced with \.br\. So the final result of the above should look like this:
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\.br\\.br\on multiple lines|More text|\r\n
09-1234AS|\.br\||\.br\\r\n
56-1234|Some text|Some more text\.br\|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
As you can see the mid-record line breaks have all been replaced with \.br\ and the end-of-line line breaks have been retained to keep each three-column/pipe record on its own line. Note the last record's text, explaining how each line/record begins. I included that in case that would help in building a regex to properly identify the beginning of a record.
I'm not sure if this can be done in one find/replace step or if it needs to be (or just should be) split up into a couple of steps.
I had the thought to first search for |\r\n, since all records end with a pipe and a CRLF, and replace those with dummy text !##$. Then search for the remaining line breaks with \r\n, which will be mid-column line breaks and replace those with \.br\, then replace the dummy text with the original line breaks that I want to keep |\r\n.
That worked for all but records that looked like the third record in the first example, which has several line breaks after a pipe within the record. In such a large file as I am working with it wasn't until much later that I found that the above process I was using didn't properly catch those instances.
You can use
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)\K\R+
Replace with \\.br\\. See the regex demo. Details:
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?) - either the end of the previous match (\G(?!^(?<!.))) or (|) start of a line, two digits, 0, one or more digits, zero or more letters, a |, then any zero or more chars other than |, as few as possible, and then an optional sequence of | and any zero or more chars other than |, as few as possible (see ^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)
\K - omit the text matched
\R+ - one or more line breaks.
See the Notepad++ demo:
If you need to remove empty lines after this, use Edit > Line Operations > Remove Empty Lines.

Special characters in EBS Search Strings?

I am working on the EBS configuration side of the SAP ERP system where I am trying to define Search Strings for the MT940 format (as per SAP SPRO activity "Define Search String for Electronic Bank Statement", for instance see this blog post).
I am trying to create a search pattern that is able to identify special characters in the MT940 format, for example ?/!/>, etc.
My search pattern: \C*######\C*
The text that I use to identify the mapping:
:86:306?00CCY RECD?20/BI/**?651234?**/BO/DE652004ED
In this case, I defined:
\C* as to look for special characters - this will be skipped based on the mapping.
# to look for a sequence of 6 numbers.
My results from the test:
1 651234
2 652004
3 651234
4 652004
The result I look to achieve based on the search pattern defined: 651234
I do understand that the reason for having the repetition is because of the * symbol. However, if I skip adding that symbol, the search pattern will end up in error.
My problem is that I cannot seem to understand how can I translate special characters to be identified by the SAP Search Strings? Furthermore, how can I identify if it is a letter?
Below is the Search String definition from the SAP documentation of SPRO activity "Define Search String for Electronic Bank Statement":
String for searches in text. A search string consists of normal characters (that is, letters and digits) and other characters:
| Or
( ) Grouping
+ Repeats the previous character once or several times
* "Zero" or repeats the previous character several times
? Any individual character you want
# Any of the digits 0 to 9
^ Start of a line
$ End of a line
\ Escape symbol
Examples:
The search string "ab" fits each position in a character string in which the letter "b" follows the letter "a".
The search string "(A+|B)+C" "AC", "BC", "AAAAAC" or "ABAAC".
"(A+|B+)C fits "AC", "BC" and "AAAAAC", but not "ABAAC".
"\*C" fits "*C"; the effect of the escape symbol is that "*" is not interpreted as a special character.
This is the first time I raise a question, therefore, I want to apologize if the format is not correct or the text was too long.
Many thanks for your time and help!

How to match the following?

The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of
companies mentioned in that weeks' issue, plus the page numbers they are appear on.
I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in
Adobe InDesign CS3 and Adobe InCopy CS3).
I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular
expressions, I am able to find most page numbers where a company is mentioned. However, where a
company name contains two or more words, the search I am running will not pick up instances where the
name appears over more than one line.
For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the
text appeared like this:
DTZ beat BNP PRE, CB [line break here]
Richard Ellis and Cushman & [line break here]
Wakefield to secure the contract. [line end here]
Could someone advise me on how to write a regular expression that will ignore white space between
words and ignore line endings OR one that will look for the words including all types of white space (ie uneven
spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is
imbedded somehow in PDF files).
Here is a sample of the set of terms I have asked PowerGREP to search for:
\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b
[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]
By the way, I am a production journalist and not usually involved in finding IT-type solutions and am
finding it difficult to get to grips with the technical language on the PowerGREP site.
Thanks for assistance
Alison
You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.
E.g.:
CB\s+Richard\s+Ellis
What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.
The regex for matching spaces is \s, so it would be
\bCB\s+Richard\s+Ellis\b
(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:
\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b