Capture everything after one word [duplicate] - regex

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I am trying to make a regular expression capture any words in the specific line after the word Attachment:
This question is for work, so it is not a homework or test question. I took the paragraph below as an example from I did not major in computers but Psychology so this is completely foreign to me. I've read the manuals for the last two days, and because this is going over my head, I don't know how to begin.
I have a task which involves me linking the attachments to a specific file with the same name saved in a folder (at least 500 attachments) on Adobe PDF. What I did before was to manually select the words and link it to a specific file in a folder, but it is tedious to do when they can go up to 500 attachments.
I was aware of an application plug-in called EVERMAP that you can download for Adobe to automatically link specific words to a specific file in a folder. However, it requires me to use regular expressions which again, I don't know how to use.
I will bold the words I want to capture in the paragraph below.
The repetition operator manual expand the match as far as they, and only come back if they must to satisfy the remainder.
Attachment: The repetition operator manual
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
Attachment: Asterisk and stars engine

Attachment: (.+) should work in your case unless there are other exceptions to this rule. The regex simply tells the parser to capture 1 or more character after the word Attachment:. See here for the sample

Like #Kevin said, the Regex is simple. Use Attachment: (.+).
Maybe you are confused on how to use Regex. I don't know about the Evermap plugin, but you can copy all the text from the PDF to Sublime Text (text editor to open .txt but with a lot of features) and do Regex part there. And then, since you are not a programmer, you should remove other irrelevant data. So the Regex will be:
And replace it with:
\1 is a variable containing group value caught in ()
In Sublime Text find Find and Replace, then apply the Regex there. Don't forget to turn on the Regex mode.


Non-greedy regexp matching too much in pandoc-generated markdown file [duplicate]

This question already has answers here:
Regular expression to get text between square brackets including disparity?
(4 answers)
Closed 3 years ago.
The Problem
I'm trying to write a simple intermediary step in a Pandoc workflow. I have an original document in .docx which I'm converting to .md using the --track-changes switch (see Pandoc reader options for more information) to produce a markdown file which has MS word insertions/deletions/comments wrapped in span tags, e.g.
[Insertion text]{.insertion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Deletion text]{.deletion id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}
[Comment body]{.comment-start id="1" author="Jamie Bowman" date="2019-04-01T11:05:00Z"}[]{.comment-end id="1"}
I want to run a regexp find and replace on the markdown file which effectively 'accepts' insertions and deletions but leaves the comment spans. (This is so when I convert back to .docx, I have a clean .docx file with comments only.)
What I've tried
I have been able to accept all insertion spans and delete all deletion spans, but only when the body text does not carry across more than one line. My attempt at matching across new lines matches too much and I can't work out how to match the exact text only.
The following regexp matches almost all deletions which I can then replace with nothing:
Find: \[(.*?)\]{.deletion(.|\n)*?}
Same for insertions which I can then use a backreference to retain the text but remove the span:
Find: \[(.*?)\]{.insertion(.|\n)*?}
Replace: $1
The patterns are matching too much, though, as you can see here.
Please let me know if anything is unclear. I've been working on this quite a bit today and it's difficult to explain the problem plainly! Thanks in advance.
The following regex should match the deletion fragments:
The regex for the insertions are mostly the same, except you have to have a capturing group ([^[]*?\):

How to create a regexp that ends in a line break? [duplicate]

This question already has answers here:
Differences between`[.]` vs `.` in regex
(2 answers)
Closed 3 years ago.
I have (looong) inputs that are lists of sentences/bullets like the following:
Broker and broker´s fees: 不適合
Specific purpose or use for the present acquisition or disposal: 因應內部管理需要,調整投資架構
Other issues to be disclosed: 無
In order to "translate" the Chinese text, I want to create objects, in a regexp fashion, so I can later transform the second captured group according to what it says.
I thought something like the following would work:
Specific_purpose = /(Specific purpose or use for the present acquisition or disposal: )([.]+)(\n)/
Other_issues = /(Other issues to be disclosed: )([.]+)(\n)/
i.e. this regexps should be composed of captured group 1 (the title in English), captured group 2 (the section in Chinese) and the captured group 3, i.e. the new line that indicates where the object ends.
Still, the code does not work and I cannot even get Ruby to find the needed objects in the input. If, for example, I add:
if input.include? Specific_purpose.to_s
puts "Yes, I found such bullet "
puts "No, there is no such bullet"
I keep getting "No, there is no such bullet", no matter how I rewrite the regexp.
Am I doing something wrong here? How do I create a regexp that will match everything until the line break?
As your line contains a colon which also acts as a separator for english and Chinese text, you can use this regex to capture English in group1 and Chinese in group2 to capture the text respectively. Try using this regex,
Let me know if you face any issues.

How to handle a tilde / swung dash (~) in a regular expression in order to exclude temporary MS Office files?

I have a batch job in xml that gets scheduled by a job scheduling engine. This engine provides the possibility of observing directories for changes of their content. My task is to monitor directories on a file exchange server running Windows, where customers and clients upload files we need to process.
We need to know about the arrival of new files as soon as possible.
I have to put a regular expression into that xml-job in order to not match subdirectories and temporary files.
In most cases, customers and clients upload files formatted as text/csv/pdf, which don't cause any problems. Some upload MS Office files, which, on the other hand, become a problem if someone opens them in the directory. Then an invisible temporary file is created beginning with ~$.
According to the documentation of the scheduling engine, the regex follows the POSIX 1003.2 standard. However, I am not able to prevent notifications being sent when someone opens an MS Office file in a monitored directory.
My regular expressions, that I have tried so far are:
First try before even noticing temporary office files:
Second try, intention was excluding a leading ~:
Third try, intention was excluding a leading ~ by its character code:
Fourth try, intention was excluding a leading ~ by its character code with a capital E:
All of those don't stop sending notifications on file openings…
Does anyone have any idea what to do?
All suggestions and alternatives are welcome.
I even checked them at regex101,, and where the second try was matching exactly as desired. I did not even forget to configure POSIX compilation if it was possible on those sites (not all).
How can I exclude the ~ character from matching the regular expression (at the beginning of a file name)?
Short version:
How can I create a regular expression that matches any file with any extension apart from .part and does neither match the file thumbs.db, nor any file whose name begins with a ~?
What should not be matched:
Subfolders (my approach was files without a .),
Thumbs.db (Windows thumbnails db),
*.part (filezilla partial uploads),
~$. (temporary files starting with ~ or ~$, MS Office tmp files)
The following list provides some files and folders that must be matched or not matched by the regex:
Ablage (subfolder, should not be matched)
Abrechnungen (subfolder, should not be matched)
~$TEST-WORKBOOK.xlsx (temporary file, should not be matched)
TEST-WORKBOOK.xlsx.part (partial upload, should not be matched)
TEST-WORKBOOK.part (partial upload, should not be matched)
New Problems occurred while trying to find the regex
A few problems came up after the creation of this question when I tried to apply the actually correct regex stated in the answer given by #Bohemian. I wasn't aware of those problems, so I just add them here for completeness.
The first one occurred when certain characters in the regex were not allowed in xml. The xml file is parsed by a java class that throws an exception trying to parse < and >, they are forbidden in xml documents if not related to xml nodes directly (valid: <xml-node>...</xml-node>, invalid: attribute="<ome_on, why isn't this VALI|>").
This can be avoided by using the html names < instead of < and > instead of >.
The second (and currently unresolved) issue is an operand criticized for the actually correct regular expression ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$. The engine says:
Error: 2018-08-17T06:05:46Z REGEX-13
[repetition-operator operand invalid, ^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$]
The corresponding line in the xml file looks like this:
<start_when_directory_changed directory="F:\someDirectory" regex="^(?=.*\.)(?!thumbs.db$)[^~].*(?<!\.part)$" />
Now I am stuck again, because my knowledge of regular expressions is pretty low. It is so low, that I don't even have any idea what character could be that criticized operand in the regex.
Research has brought me to this question whose accepted answer states "POSIX regexes don't support using the question mark ? as a non-greedy (lazy) modifier to the star and plus quantifiers (…)", which gives me an idea about what is wrong with the great regex. Still, I am not able to provide a working regex, more research will have to follow…
POSIX ERE doesn't allow for a simple way to exclude a particular string from matching. You can disallow a particular character -- like in [^.part] you are matching a single character which is not (newline or) dot or p or a or r or t -- and you can specify alternations, but those are very cumbersome to combine into an expression which excludes some particular patterns.
Here's how to do it, but as you can see, it's not very readable.
... and it still probably doesn't do exactly what you want.
Try this:
See live demo.
There is nothing special about the tilda character in regex.
I am very late on this but above comments were helpful for me. It may not work for you but my solution is:
file_list <- file_list[!grepl("~", file_list)]

Remove first char from string - Regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I have started using Workflow on iOS to help speed up tasks at work. One of those is entering delivery records into the computer (via the iPad barcode scan function) instead of manually writting down the ref code and then typing it in.
Workflow has a "Replace Text" function that can be used with regexs to strip out characters etc.
I have managed to find a regex to get rid of the last digit in a scan (a checksum digit, always a capital letter).
The regex is simple.
This goes in the "Find Text" field. The "Replace With" is left empty. It works wonderfully.
How can adapt this to work with other scan types with other scan types where I want to specically get rid of the FIRST character only? I've searched the forums but can only find long and difficult to interpret regexes that I am sure won't do what I am trying to achive, something simple by comparison.
An example is of what I mean is to convert "Y300006944" to "300006944"
You can use the following regex:
with a backreference $1 that you can use as replacement.
Good luck.
Thanks to those who contributed somehting useful :)
I got the it resolved by using the "Split Text" function in Workflow for iOS.
I gave it the command to split based on a customer char, "Y" in this case. It's enough in this simple case.

Regular expression to remove comment

I am trying to write a regular expression which finds all the comments in text.
For example all between /* */.
/* Hello */
When I do this:/\*.*\*/, it behaves odd and nothing is shown. What is wrong with it?
EDIT: The comments can be spread across multiple lines
Unlike the example posted above, you were trying to match comments that spanned multiple lines. By default, . does not match a line break. Thus you have to enable multi-line mode in the regex to match multi-line comments.
Also, you probably need to use .*? instead of .*. Otherwise it will make the largest match possible, which will be everything between the first open comment and the last close comment.
I don't know how to enable multi-line matching mode in Sublime Text 2. I'm not sure it is available as a mode. However, you can insert a line break into the actual pattern by using CTRL + Enter. So, I would suggest this alternative:
If Sublime Text 2 doesn't recognize the \n, you could alternatively use CTRL + Enter to insert a line break in the pattern, in place of \n.
I encountered this problem several years ago and wrote an entire article about it.
If you don't have access to non-greedy matching (not all regex libraries support non-greedy) then you should use this regex:
If you do have access to non-greedy matching then you can use:
Also, keep in mind that regular expressions are just a heuristic for this problem. Regular expressions don't support cases in which something appears to be a comment to the regular expression but actually isn't:
someString = "An example comment: /* example */";
// The comment around this code has been commented out.
// /*
// */
Just want to add for HTML Comments is is this
Just an additionnal note about using regex to remove comments inside a programming language file.
Doing this you must not forget the case where you have the string /* or */ inside a string in the code - like var string = "/*"; - (we never know if you parse a huge code that is not yours)!
So the best is to parse the document with a programming language and have a boolean to save the state of an open string (and ignore any match inside open string).
Again a string delimited by " can contain a \" so pay attention with the regex!
You cannot write a regular expression that would be able to correctly find all comments, or even one type of comments - single-line or multiline.
Regular expressions can only provide a partial match, one that would would cover perhaps 90% of all cases, but that's it.
The syntax for regular expression is so complex, it is only possible to identify them correctly in 100% of cases by doing a full expression evaluation, which in turn is based on tokenizing the code. The latter is a huge task, which is implemented by all AST parsers today. See AST Explorer
Only a proper-written AST parser can tell you precisely where all regular expressions are located in your code. You would have to write a parser then based on that.
Or, you could use one of the existing libraries that already do all that, like decomment.
RegEx examples where any head-on approach is going to stumble, being unable to tell a regular expression from a comment block:
/\// - it will think this reg-ex is a single-line comment
/\/*/ - it will think this reg-ex opens a multi-line comment
The answer which user1919238 wrote works. Just corroborating that here, although the many upvotes probably do give you a clue.
It got rid of all these annoying block comments, put here just to show the usefulness/thank user1919238 for saving time:
/*# sourceMappingURL=data:application/json;base64,eyJ2ZXJzaW9uIjozLCJzb3VyY2VzIjpbIndlYnBhY2s6Ly9zdHlsZXMvZ2xvYmFscy5jc3MiXSwibmFtZXMiOltdLCJtYXBwaW5ncyI6IkFBQUE7O0VBRUUsVUFBVTtFQUNWLFNBQVM7RUFDVDt3RUFDc0U7QUFDeEU7O0FBRUE7RUFDRSxjQUFjO0VBQ2QscUJBQXFCO0FBQ3ZCOztBQUVBO0VBQ0Usc0JBQXNCO0FBQ3hCIiwic291cmNlc0NvbnRlbnQiOlsiaHRtbCxcbmJvZHkge1xuICBwYWRkaW5nOiAwO1xuICBtYXJnaW46IDA7XG4gIGZvbnQtZmFtaWx5OiAtYXBwbGUtc3lzdGVtLCBCbGlua01hY1N5c3RlbUZvbnQsIFNlZ29lIFVJLCBSb2JvdG8sIE94eWdlbixcbiAgICBVYnVudHUsIENhbnRhcmVsbCwgRmlyYSBTYW5zLCBEcm9pZCBTYW5zLCBIZWx2ZXRpY2EgTmV1ZSwgc2Fucy1zZXJpZjtcbn1cblxuYSB7XG4gIGNvbG9yOiBpbmhlcml0O1xuICB0ZXh0LWRlY29yYXRpb246IG5vbmU7XG59XG5cbioge1xuICBib3gtc2l6aW5nOiBib3JkZXItYm94O1xufVxuIl0sInNvdXJjZVJvb3QiOiIifQ== */
if you want to replace the obnoxious comment from flutter main.dart,
Press cmd +r on mac or cntrl+ r on windows,
type //.* into the box above, leave the box below empty
click .* on the replace dialog, to activate regex,
then click on replace all. this will remove all your comments, you can do this if you want to remove all comments in any file in a flutter.
Additional, to reformat the main.dart
press cmd+a on mac and cntrl+a on windows,
then press cmd+alt(option)+l or cntrl+alt+l, this will reformat the code.
I will attach a picture of the main. dart, the green .* at the top of the page is what you will press to activate the regex.