Unable to find string using Regex in Word 2010 - regex

I am trying to find a string in word, I can see 3 of the strings in the document. However, the remaining 600+ of them are not visible.
I'm trying to search using (this is the regex in the external tool I used initially):
(ABC-\d+)
Using a tool to search in Word I searched for
(ABC.*)
and all of the results ended up being some form of the following:
ABCNormal -13
I don't have a clue how to find out what that even means in this context.
I tried searching IN Word for the following REGEX and it doesn't find any except the 3 that don't have the "normal " thing.
ABC?#[0-9]#
That should mean look for ABC some number of characters and some number of numbers.
I have tried turning on the hidden text/etc within the display options, the paragraph icon on the ribbon, anything I an think of.
Any ideas how to figure out how to SEE what this is, and either fix it, or work around it?
In the external tool [(ABC)[^0-9]+(\d+)] finally worked, but I still don't understand how to remove the Normal Text that is in the string that is NOT visible.
For example the string I visibly see
ABC-13
the text Regex is seeing is
ABCNormal -13

Related

What is the name for the type of "loose" search/filter that allows for other characters in the middle?

I'm not sure which community this belongs in, feel free to suggest a better one if this doesn't fit here.
In Visual Studio Code, when searching for a file, you can CMD/Ctrl + P to bring up the Quick Open search box for finding a file by name. The search doesn't have to be the exact name and it filters as long as the search query contains the characters in that order, while being "loose" enough to ignore any characters between those.
Example:
Searching "cat" would show the following:
bigcat.txt
cat.txt
candlelight.txt
In the above, all the strings contain "cat" within it, even if there are other characters between it. The regex would probably be something like /.*c.*a.*t.*/.
Is there a name for this type of search/filter?
Fuzzy Filter/Search
After looking through VS Code's GitHub issues list, I found an issue that mentioned it.
I also found a node module that does this exact same thing.
There is also a Wikipedia entry on Approximate String Matching, which is similar to the above.

How combine regex and find in shell script properly

I am trying to write a shell script which can take numbers from a text document and use these numbers to search for all pictures that include the numbers in their name.
I am working with find and it I got it to kinda work. If the name of the picture is exactly the same as the name in the text document, or if the name of the picture ends with whatever number is written in the text document it works. But if the number is in the middle of the name of the picture, it doesn't find it. So I have been trying to add regex to my find command but I haven't been successful.
input="/Users/unix/Desktop/pictures.txt"
input_2="/Users/unix/Desktop/2019/05/23"
while IFS= read -r -u3 line
do
find "$input_2" -iregex ".*${line}*.jpg"
done 3< "$input"
For example if the picture name is Right.jpg and my pictures.txt contains Right, it will find the file. If the picture is called leftRight.jpg, it will also find the File. But if it's something like leftRightleft.jpg, it won't find the picture, so I am a bit confused on how to use regex properly here.
Your regex is simply incorrect. If you break it down, it makes intuitive sense why:
.*${line}*.jpg
means:
.* -- any character repeated 0 or more times
${line}* -- the contents of ${line}, with the last character repeated 0 or more times
. -- any single character
jpg -- the literal characters jpg
So with your example, if you have Right in your file, you'd match actual files like these, which you probably don't want to match:
leftRigh.jpg
leftRighXjpg
leftRighttttttttt.jpg
leftRighttttttttttttttttttttttttttttttttjpg
What you probably want is:
.*${line}.*\.jpg

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

EditPad: Need a regex that handles multiple possible data formats

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.
I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.
Here is some sample data:
3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre
Here is how the above data needs to end up:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
So, there are numerous rules, such as:
In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
If a question mark is present, everything from the mark onwards should be removed.
If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
If the string contains 2 underscores, averything after the first numeric string up to a dash
(if present) should be removed, and the string "sm-" should be prepended.
Additionally there is other data that must be left untouched, including but not limited to:
113535|24905|24905
as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)
This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.
Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.
That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.
Here is how.
The solution below is tested in EPP.
Regex 1
Press Ctrl + Sh + F to enter Search / Replace mode
Enter the following Search and Replace in the appropriate boxes
At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1
Search:
(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)
Replace:
\1\2\3\4\5
Regex 2
Same 1-2-3 steps as above.
Search
^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
Replace
sm-\1\2
Chaining Regex 1 and Regex 2
Top menu: Macros, Record Macro, give it a name.
Click the Favorite searches pulldown, select Regex 1
Hit Replace All.
Click the Favorite searches pulldown, select Regex 2
Hit Replace All.
Macros, Stop recording.
Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.
Testing This
I have tested my "Jonathan macro" on your input. Here is the result:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
Try this:
Toggle the Search Panel : SHIFT+CTRL+F
SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
REPLACE: $1
Check REGEX and WORDS
Click Replace All or Hit CTRL+ALT+F3
Check the image below:

Find Lines with N occurrences of a char

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).
this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.
I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"
You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.