Extracting parts of words from a file

Extracting parts of words from a file - list

I have a list, "list A," containing tens of thousands of entries (4 example entries shown below). I'd like to create from "list A" another list, "list B." I need each entry of "list B" to contain only the 1st 4 (out of 5) characters of the 1st "word" following the ">" character that is at the start of each entry in "list A."
So, for example, I'd like a "list B" that looks like this:
2JUG
3JU9
1JU8
3JUE
I am new to script writing, and would appreciate any help you can offer. The closest I got to solving my problem was printing the 1st column, but that gave me all 5 characters of my 1st "word," plus the long string of letters in the next line. I am brand new to script writing, so if possible, please try to give me the "for dummies" version of your explanations. Thank you!
example entries from "List A" below
2JUGA 78 NMR NA NA NA no TubC protein [ANGIOCOCCUS DISCIFORMIS] || 2JUGB
GPLGSSAGALLAHAASLGVRLWVEGERLRFQAPPGVMTPELQSRLGGARH
ELIALLRQLQPSSQGGSLLAPVARNGRL
3JU9A 237 XRAY 2.10 0.207 0.253 no Concanavalin-Br [CANAVALIA BRASILIENSIS] || 1AZDA 1AZDB 1AZDC 1AZDD 4H55A
ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVGTA
HIIYNSVGKRLSAVVSYPNGDSATVSYDVDLDNVLPEWVRVGLSASTGLY
KETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQGDATTGT
EGNLRLTRVSSNGSPQGSSVGRALFYAPVHIWESSAVVASFEATFTFLIK
SPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN
1JU8A 37 NMR NA NA NA no Leginsulin [NA]
ADCNGACSPFEVPPCRSRDCRCVPIGLFVGFCIHPTG
3JUEA 368 XRAY 2.30 0.203 0.219 no ARFGAP with coiled-coil, ANK repeat and PH domain-containing protein 1 [HOMO SAPIENS] || 3JUEB
GPLGSGSGHLAIGSAATLGSGGMARGREPGGVGHVVAQVQSVDGNAQCCD
CREPAPEWASINLGVTLCIQCSGIHRSLGVHFSKVRSLTLDSWEPELVKL
MCELGNVIINQIYEARVEAMAVKKPGPSCSRQEKEAWIHAKYVEKKFLTK
LPEIRGRRGGRGRPRGQPPVPPKPSIRPRPGSLRSKPEPPSEDLGSLHPG
ALLFRASGHPPSLPTMADALAHGADVNWVNGGQDNATPLIQATAANSLLA
CEFLLQNGANVNQADSAGRGPLHHATILGHTGLACLFLKRGADLGARDSE
GRDPLTIAMETANADIVTLLRLAKMREAEAAQGQAGDETYLDIFRDFSLM
ASDDPEKLSRRSHDLHTL

Here is the solution - pretty straightforward, but will do the job assuming the input provided:
Sub NMRData()
With Selection.Find
.Text = "^p>"
.Replacement.Text = "###>"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "^p"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.Text = "###>"
.Replacement.Text = "^p>"
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
For k = ThisDocument.Paragraphs.Count To 1 Step -1
Set oPara = ThisDocument.Paragraphs(k)
oPara.Range.Text = Left(oPara.Range.Text, 5) & vbNewLine
Next k
ThisDocument.SaveAs FileName:="listB.docx", FileFormat:=wdFormatXMLDocument
End Sub
Desired output will be saved in the same folder as a new DOCX file.
To run the code, press ALT+F11, and then F5 - via VBA interface, or press ALT+F8 to select and run a Macro by name.
Sample DOCM with ready-to-go code: https://www.dropbox.com/s/6zt4nfn7rt8eqc7/NMRDataListA.docm
P.S. this is my very 1st Word-VBA experience)

Related

Word VBA use Regex for formatting change

I have a word document that has SQL code in it. I have great code that finds specific terms I didn't realize that I had so many different terms in my document. I stopped at over 140 terms. I tried to add regex but I did it wrong. Would it be possible to have it find the start word SELECT and the end word FROM and change the background color to wdBrightGreen for everything between those terms?
I did not put all the fields in I stopped at three. Is there a way to put the ArrFind to search the whole document instead of specific terms?
Dim ArrFnd As Variant
Dim i As Long
Dim x As Long
Dim y As Long
Dim regexObject As RegExp
Dim strField1(1 To 200) As String
Dim strField2(1 To 20) As String
strField1(1) = "SELECT DISTINCT policy_id"
strField1(2) = "SELECT DISTINCT policy_id"
strField1(3) = "SELECT DISTINCT P.policy_id"
For x = 1 To 150
ArrFnd = Array(strField1(x))
With ActiveDocument
For i = 0 To UBound(ArrFnd)
With .Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Format = False
.Forward = True
.Wrap = wdFindStop
.MatchCase = True
.MatchWholeWord = True
.Text = ArrFnd(i)
With regexObject
.Pattern = "[SELECT-FROM]"
End With
.Replacement.Text = ""
End With
Do While .Find.Execute
.Font.Shading.BackgroundPatternColorIndex = wdBrightGreen
.Collapse wdCollapseEnd
Loop
End With
Next i
End With
Next x

One wonders why you persist with the approach taken in your code, in light of: MS Word VBA Can't select specific words to change background color
Regardless, for what you are now describing, all you need is:
Sub HiliteWords()
Application.ScreenUpdating = False
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Format = False
.Forward = True
.Wrap = wdFindStop
.MatchWildcards = True
.Text = "SELECT[!^13^l]#FROM"
.Replacement.Text = ""
End With
Do While .Find.Execute
.Start = .Words.First.End
.End = .Words.Last.Start - 1
.Font.Shading.BackgroundPatternColorIndex = wdBrightGreen
.Collapse wdCollapseEnd
Loop
End With
Application.ScreenUpdating = True
End Sub
If you also don't want DISTINCT shaded, insert:
If .Words.First.Text = "DISTINCT " Then .Start = .Words.First.End
before:
.End = .Words.Last.Start - 1

VBA Word: Regex replace capture group does not highlight

I cannot figure out why the code is not working. I want to find within a cell (XXX) regex match and highlight each occurrence. Tried a lot of things.
rangeFinding.SetRange Start:=rngStart.Start, End:=rngEnd.End
rangeFinding.Select
With Selection
.Tables(1).cell(2, 2).Range.Text = y
End With
Dim qq As Range
Set qq = GetCellTextRange(Selection.Tables(1).cell(2, 2))
Options.DefaultHighlightColorIndex = wdYellow
With qq.Find
.ClearFormatting
.Text = "(\(*\))"
.MatchWildcards = True
.Replacement.ClearFormatting
.Replacement.Text = "\1"
.Replacement.Highlight = True
.Replacement.Font.Color = vbRed
.Execute Replace:=wdReplaceAll
End With
qq.HighlightColorIndex = wdYellow
End If
Procedure that is called:
Function GetCellTextRange(cl As cell) As Range ' ' GetCellTextRange
returns the content range of a table cell '
Dim rng As Range
Set rng = cl.Range
rng.MoveEnd wdCharacter, -1
Set GetCellTextRange = rng
End Function
Changing font color and/or bold does work, however highlighting not. Even when I replace \1 with YYY it works. But highlighting does not get applied, either when I use .Replacement.Highlight or alternative qq.HighlightColorIndex = wdYellow.
BTW: does the Find (and its Execute) function adapt qq range? That is what I noticed, but the docs do not tell it clearly.

I ran into the same problem with wildcard search/replace in VBA code not highlighting text. I was able to work around it by explicitly setting the default highlight colour before running the wildcard search/replace, and then setting it back to whatever it was afterward, e.g.:
currentHighlightColour = Options.DefaultHighlightColorIndex
Options.DefaultHighlightColorIndex = wdYellow
With ActiveDocument.Range.Find
.ClearFormatting
.replacement.ClearFormatting
.Text = "(\{\{placeholder: *\}\})"
.Format = True
.replacement.Text = "\1"
.Forward = True
.matchWildcards = True
.matchCase = False
.replacement.Highlight = True
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
Options.DefaultHighlightColorIndex = currentHighlightColour
Seems like a bug in Word, but at least there's a workaround.

Find words with more than one capital letter in word/VBA

I have a piece of VBA code that uses Find to find all the acronyms in a document. It does this by searching for all words consisting of capital letters that are 2 or more characters long using...
<[A-Z]{2,}>
The problem with this is it doesn't pick up all the acronyms, such as CoP, W3C, DVDs and CD-ROM. It picks up hyphenated acronyms in two parts which are not ideal but allowable as the list is checked by a user. I can also pick up words that end with an "s" or other characters by not searching to the end of the word using...
<[A-Z]{2,}
But this doesn't count any non-upper case character as part of the word it finds.
Is there an expression that would allow me to search for words with two or more capital letters in any location and find the whole word?

I don't think it's possible to 'search for words with two or more capital letters in any location and find the whole word' except in combination with macro code. Since you're using a macro, anyway, here's an approach that worked for me using the following sample text
CoP, this That and AnoTher thing W3C, DVDs and CD-ROM
and this wildcard combination (note that the list separator in my Windows configuration is ;, for other regions a , may be required).
<[A-Z][0-9A-Z\-a-z]{1;10}>
The following function checks whether the second or any later letter in the "found" range is capitalized and returns a boolean to the calling procedure. It loops through the characters in the given Range, checking the ASCII value. As soon as one is found, the loop exits.
Function ContainsMoreThanOneUpperCase(rng As Word.Range) As Boolean
Dim nrChars As Long, i As Long
Dim char As String
Dim HasUpperCase
HasUpperCase = False
nrChars = rng.Characters.Count
For i = 2 To nrChars
char = rng.Characters(i).text
If Asc(char) >= 65 And Asc(char) <= 90 Then
'It's an uppercase letter
HasUpperCase = True
Exit For
End If
Next
ContainsMoreThanOneUpperCase = HasUpperCase
End Function
An example for using it:
Sub FindAcronyms()
Dim rngFind As Word.Range
Dim bFound As Boolean
Set rngFind = ActiveDocument.content
With rngFind.Find
.text = "<[A-Z][0-9A-Z\-a-z]{1;10}>"
.MatchWildcards = True
.Forward = True
.wrap = wdFindStop
bFound = .Execute
Do While bFound
If bFound And ContainsMoreThanOneUpperCase(rngFind) Then
Debug.Print rngFind.text
rngFind.HighlightColorIndex = wdBrightGreen
End If
rngFind.Collapse wdCollapseEnd
bFound = .Execute
Loop
End With
End Sub

You can't do this in one pass of Find/Replace. You also have to make some allowances for what the Word application considers a Word and then where the acronym is located in the sentence or paragraph.
The following code should provide an idea for how you might do it with a combination of Wildcard searching and then additional VBA string manipulation.
It is setup to deal with words that start with capital letters, you will need to carry it further and add code and wildcard search criteria for words that start with lowercase letters if you expect to have any of those.
Sub FindAcronynms()
Dim rng As word.Range
Set rng = ActiveDocument.Content
With rng.Find
.ClearFormatting
.Format = False
.Forward = True
.MatchWildcards = True
.Text = "<[A-Z]{1,}[a-z][A-Z]>"
.Wrap = wdFindStop
.Execute
Do While .found
MoveEndOfString rng
rng.HighlightColorIndex = wdTeal
rng.Collapse wdCollapseEnd
.Execute
Loop
End With
Set rng = ActiveDocument.Content
With rng.Find
.ClearFormatting
.Format = False
.Forward = True
.MatchWildcards = True
.Text = "[A-Z]{1,5}[0-9][A-Z]{1,5}"
.Wrap = wdFindStop
.Execute
Do While .found
MoveEndOfString rng
rng.HighlightColorIndex = wdTeal
rng.Collapse wdCollapseEnd
.Execute
Loop
End With
Set rng = ActiveDocument.Content
With rng.Find
.ClearFormatting
.Format = False
.Forward = True
.MatchWildcards = True
.Text = "<[A-Z]{2,}>"
.Wrap = wdFindStop
.Execute
Do While .found
MoveEndOfString rng
rng.HighlightColorIndex = wdTeal
rng.Collapse wdCollapseEnd
.Execute
Loop
End With
MsgBox "Action Complete", vbExclamation, "Custom Find"
End Sub
Private Function MoveEndOfString(ByRef rng As word.Range)
rng.MoveEnd wdCharacter, 1
Select Case Asc(rng.Characters.Last)
Case Is <= 32
rng.MoveEnd wdCharacter, -1
Case 45
rng.MoveEnd wdCharacter, 1
rng.MoveEnd wdWord, 1
If Asc(rng.Characters.Last) = 32 Then
'required because move above includes
'trailing space
rng.MoveEnd wdCharacter, -1
End If
End Select
End Function

You might use something like:
Sub Demo()
Application.ScreenUpdating = False
Options.DefaultHighlightColorIndex = wdPink
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Replacement.Highlight = True
.Forward = True
.Format = True
.Wrap = wdFindContinue
.MatchWildcards = True
.Text = "<[A-Z][A-Z0-9/-]{1,}"
.Replacement.Text = "^&"
.Execute Replace:=wdReplaceAll
.Text = "<[A-Z][A-Za-z0-9/-]#[A-Z]"
.Replacement.Text = "^&"
.Execute Replace:=wdReplaceAll
End With
End With
Application.ScreenUpdating = True
End Sub

Remove Hebrew vowels (nikkud) from selected Unicode Hebrew text

I want to select a string of Unicode Hebrew text in a Word document and remove the Hebrew vowels (aka nikkud) without changing anything else.
I need to remove Unicode characters in a given range from the selected text. The Unicode characters I want to remove are U+0591-U+05BD, U+05BF-U+05C2, and U+05C4-U+05C7.
I found a way to remove the Hebrew vowels from a Unicode text string using the REGEXREPLACE function in Google Sheets (thank you GitHub). E.g:
=REGEXREPLACE(B1,"[(\x{0591}-\x{05BD})OR(\x{05BF}-\x{05C2})OR(\x{05C4}-\x{05C7})]","")
where cell B1 contains the original Hebrew text with vowels, and the function outputs the identical text with the vowels removed. The Unicode range used there permits me to leave two characters that need to remain (U+05BE and U+05C3).
Using that method, I can copy a Hebrew text string, e.g., אָמַר יְהוָה, paste it into my Google Sheet, and then copy the output, אמר יהוה, and paste it over the original text. This is much slower than a macro in Word would be (there are hundreds of these Hebrew text strings that need to be fixed). The majority of the document is in English, with snippets of Hebrew, so I don't need a solution for converting a whole document.
A bit of searching suggests to me that a similar RegEx replace function exists for Word VBA, but I don't have sufficient programming knowledge to adapt this to my own needs.

You can try this Macro. Be warned, it's very slow on my end:
Sub RemoveHebrewVowels()
Dim Word As Range
Dim Words As Variant
Dim WildcardCollection(3) As String
Rem [(\x{0591}-\x{05BD}]
WildcardCollection(0) = "[" & ChrW(1425) & "-" & ChrW(1469) & "]{1;}"
Rem [\x{05BF}-\x{05C2}]
WildcardCollection(1) = "[" & ChrW(1471) & "-" & ChrW(1474) & "]{1;}"
Rem [\x{05C4}-\x{05C7}]
WildcardCollection(2) = "[" & ChrW(1476) & "-" & ChrW(1479) & "]{1;}"
'Options.DefaultHighlightColorIndex = wdYellow
'Clear existing formatting and settings in Find
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
'Selection.Find.Replacement.Highlight = True
'Cycle through document and find wildcards patterns, replace when found
For Each Word In ActiveDocument.Words
For Each WildcardsPattern In WildcardCollection
With Selection.Find
.Text = WildcardsPattern
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = True
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Next
Next
End Sub

You can install notepad++ and do a find and replace operation using regex mode using this regex after pasting your whole input.
[\x{0591}-\x{05BD}\x{05BF}-\x{05C2}\x{05C4}-\x{05C7}]
Before:
After:
Then you can automate the copy/paste operation using AutoHotkey for example
If you want to keep the formatting information this is not a problem neither.
Just do the following operations:
Save your file in Word XML Document (Save as>Save as type: Word XML Document (*.xml)
Take a copy of this file and open it with Notepad++ (you have either to take a copy of the file or to close Word otherwise you can not open it in write mode)
Apply the find and replace described in the beginning of the explanations and save the file.
Reopen the file with Word and save it .docx for example

Thanks, everyone. Building on several of these suggestions, I put together the following macro, which seems to be working perfectly. There may be a more elegant way to write this (wp78de's macro seems more consolidated, but it didn't work for me).
Sub HebrewDevocalizer()
With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "[" & ChrW(1425) & "-" & ChrW(1469) & "]"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "[" & ChrW(1471) & "-" & ChrW(1474) & "]"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
With Selection.Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "[" & ChrW(1476) & "-" & ChrW(1479) & "]"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Anyone who needs to use this in a software script (Python 3), you can do
import re
re.sub(r'[\u0591-\u05BD\u05BF-\u05C2\u05C4-\u05C7]', '', 'אֱלֹהִים')

BS"D
Do a "Save as" to other format- Hebrew DOS text.
Reload the file in Word and you will see that a question mark has replaced each nikud.
Do a global change (cntrl H) of '?' to null.
All done

Why is this VB program always returning 1 as page number?

This program finds certain words in a MS Word document using the RegExp method. Once each match is found, the program is supposed to find the page of each match and create a string that can be output to show all pages where that keyword match was found. The way it's written now, for some reason it's outputting a "1" for each page no matter what page it's found on. For example, if the word "Mouse" was found on page 1, 5, and 22, it would output 1, 1, 1,.
For Each Match In RegExp.Execute(oWord.ActiveDocument.Range.Text)
myKeyWords(numKeywords) = Match.Value
PageNumbers(numKeywords) = ""
With myWordDoc.ActiveDocument.Range.Find
.ClearFormatting()
.Text = Match.value
.Wrap = False
.Forward = True
Do While .Execute = True
If PageNumbers(numKeywords) = "" Then
PageNumbers(numKeywords) = oWord.Selection.Information(Word.WdInformation.wdActiveEndAdjustedPageNumber)
Else
PageNumbers(numKeywords) = PageNumbers(numKeywords) & ", " & oWord.Selection.Information(Word.WdInformation.wdActiveEndAdjustedPageNumber)
End If
Loop
End With

The numKeywords isn't incrementing. I've written it out without the
For Each Match In RegExp.Execute(oWord.ActiveDocument.Range.Text)
Here it is, be aware that if you take this function outside of the regexp method that you'll need to change the .Text = myKeywords(x) and remove the .range before find.
PageNumbers(numKeywords) = ""
For x = LBound(myKeywords) To UBound(myKeywords)
PageNumbers(x) = ""
With myWordDoc.ActiveDocument.Find
.ClearFormatting()
.Text = myKeywords(x)
.Wrap = False
.Forward = True
Do While .Execute = True
If PageNumbers(numKeywords) = "" Then
PageNumbers(numKeywords) = oWord.Selection.Information(Word.WdInformation.wdActiveEndAdjustedPageNumber)
Else
PageNumbers(numKeywords) = PageNumbers(numKeywords) & ", " & oWord.Selection.Information(Word.WdInformation.wdActiveEndAdjustedPageNumber)
End If
Loop
End With

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting parts of words from a file - list

Related

Word VBA use Regex for formatting change

VBA Word: Regex replace capture group does not highlight

Find words with more than one capital letter in word/VBA

Remove Hebrew vowels (nikkud) from selected Unicode Hebrew text

Why is this VB program always returning 1 as page number?

Categories

Resources