How to create a regexp that ends in a line break? [duplicate] - regex

This question already has answers here:
Differences between`[.]` vs `.` in regex
(2 answers)
Closed 3 years ago.
I have (looong) inputs that are lists of sentences/bullets like the following:
Broker and broker´s fees: 不適合
Specific purpose or use for the present acquisition or disposal: 因應內部管理需要,調整投資架構
Other issues to be disclosed: 無
In order to "translate" the Chinese text, I want to create objects, in a regexp fashion, so I can later transform the second captured group according to what it says.
I thought something like the following would work:
Specific_purpose = /(Specific purpose or use for the present acquisition or disposal: )([.]+)(\n)/
Other_issues = /(Other issues to be disclosed: )([.]+)(\n)/
i.e. this regexps should be composed of captured group 1 (the title in English), captured group 2 (the section in Chinese) and the captured group 3, i.e. the new line that indicates where the object ends.
Still, the code does not work and I cannot even get Ruby to find the needed objects in the input. If, for example, I add:
if input.include? Specific_purpose.to_s
puts "Yes, I found such bullet "
else
puts "No, there is no such bullet"
end
I keep getting "No, there is no such bullet", no matter how I rewrite the regexp.
Am I doing something wrong here? How do I create a regexp that will match everything until the line break?

As your line contains a colon which also acts as a separator for english and Chinese text, you can use this regex to capture English in group1 and Chinese in group2 to capture the text respectively. Try using this regex,
(.+):\s+(.+)
Demo
Let me know if you face any issues.

Related

How to create capturing groups in regular expressions? [duplicate]

This question already has answers here:
How do I regex match with grouping with unknown number of groups
(6 answers)
Closed 3 years ago.
I'm trying to use Regex to classify groups of data dependent on how many sectors are within an array. E.g.:
group = One journey
group|group = Two journeys
group|group|group = Three journeys
Could someone tell me the best practice way to do this please?
EDIT: Apologies but I'm pretty new to RegEx and still trying to work things out. I don't know which language I'm using but the tool I'm building these into is Adobe Analytics - using the Classification Rule Builder.
Also, this question has been marked as duplicate but I can't say I found the other thread particularly helpful.
I've also tried experimenting using Regex101 but still can't get my head around this. Thanks.
For such a case you need to capture what you want to match inside some block that would depend on the language you are using. For example, if you are using Python you can use:
(\w+)
This regex will allow you to capture and count every repetition of word characters, that is [a-zA-Z0-9_], that will be able to capture all the text you have between pipes.
By the way, in order to test your regex and to do some basic training and trial-error approach you can use tools like this one.

How to exclude files names that have exact match with regex and to support multiple patterns? [duplicate]

This question already has answers here:
Regex: match everything but a specific pattern
(6 answers)
Alternation operator inside square brackets does not work
(2 answers)
Closed 4 years ago.
I have a program that transfers files from one folder to another (and does some manipulation on those files),
I am trying to create a regex to exclude certain files based on file name and extension, this also needs to support multiple entries that users enter the exclude patterns they want like -
exclude pattern: thumbs.db ; donotmove.xls ; *.ini;
I have a folder with the following c:\testfolder:
test.txt
test.exe
something.jpg
other.pdf
Above should be moved (could be any file)
Below should not be moved (could be anything that the user specifies in config
thumbs.db
donotmove.xlsx
The goal is to exclude unwanted files like thumbs.db, and other files that the user will want to exclude (the code is already written, I am just having trouble with the regex itself).
I have been able to create a regex that will capture thumbs.db, but I have not been able to successfully add a negative lookahead into the regex.
The regex I have is:
^[Tt][Hh][Uu][Mm][Bb][Ss]\.[Dd][Bb]$
this captures thumbs.db exactly meaning if I have a file with this name "tthumbs.db" it should not capture the file name.
first problem I have is not being able to use negative lookhead to exclude the string
I have tried the following:
^(?![Tt][Hh][Uu][Mm][Bb][Ss].[Dd][Bb])$
^(?![Tt][Hh][Uu][Mm][Bb][Ss]).(?![Dd][Bb])$
^(?![Tt])(?![Hh])(?![Uu])(?![Mm])(?![Bb])(?![Ss])).(?![Dd])(?![Bb])$
second problem I have is not being able to correctly use capturing group to capture multiple strings in one regex, I tried to add another capture group but adding does not capture anything
^([Tt][Hh][Uu][Mm][Bb][Ss].[Dd][Bb])([Tt][Ee][Ss][Tt].[Tt][Xx][Tt])$
thnx in advance.

Remove first char from string - Regex [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 4 years ago.
I have started using Workflow on iOS to help speed up tasks at work. One of those is entering delivery records into the computer (via the iPad barcode scan function) instead of manually writting down the ref code and then typing it in.
Workflow has a "Replace Text" function that can be used with regexs to strip out characters etc.
I have managed to find a regex to get rid of the last digit in a scan (a checksum digit, always a capital letter).
The regex is simple.
.{0}-$.
This goes in the "Find Text" field. The "Replace With" is left empty. It works wonderfully.
How can adapt this to work with other scan types with other scan types where I want to specically get rid of the FIRST character only? I've searched the forums but can only find long and difficult to interpret regexes that I am sure won't do what I am trying to achive, something simple by comparison.
An example is of what I mean is to convert "Y300006944" to "300006944"
You can use the following regex:
^.(.*)$
with a backreference $1 that you can use as replacement.
Good luck.
Thanks to those who contributed somehting useful :)
I got the it resolved by using the "Split Text" function in Workflow for iOS.
I gave it the command to split based on a customer char, "Y" in this case. It's enough in this simple case.

Capture everything after one word [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I am trying to make a regular expression capture any words in the specific line after the word Attachment:
This question is for work, so it is not a homework or test question. I took the paragraph below as an example from www.regular-expressions.info. I did not major in computers but Psychology so this is completely foreign to me. I've read the manuals for the last two days, and because this is going over my head, I don't know how to begin.
I have a task which involves me linking the attachments to a specific file with the same name saved in a folder (at least 500 attachments) on Adobe PDF. What I did before was to manually select the words and link it to a specific file in a folder, but it is tedious to do when they can go up to 500 attachments.
I was aware of an application plug-in called EVERMAP that you can download for Adobe to automatically link specific words to a specific file in a folder. However, it requires me to use regular expressions which again, I don't know how to use.
I will bold the words I want to capture in the paragraph below.
The repetition operator manual expand the match as far as they, and only come back if they must to satisfy the remainder.
Attachment: The repetition operator manual
The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more.
Attachment: Asterisk and stars engine
Attachment: (.+) should work in your case unless there are other exceptions to this rule. The regex simply tells the parser to capture 1 or more character after the word Attachment:. See here for the sample
Like #Kevin said, the Regex is simple. Use Attachment: (.+).
Maybe you are confused on how to use Regex. I don't know about the Evermap plugin, but you can copy all the text from the PDF to Sublime Text (text editor to open .txt but with a lot of features) and do Regex part there. And then, since you are not a programmer, you should remove other irrelevant data. So the Regex will be:
`^\s*Attachment:\s*(.+)$|^(?!Attachment:).+$`
And replace it with:
`\1`
\1 is a variable containing group value caught in ()
In Sublime Text find Find and Replace, then apply the Regex there. Don't forget to turn on the Regex mode.

Bash regex to detect IPv6 or none [duplicate]

This question already has answers here:
Regular expression that matches valid IPv6 addresses
(30 answers)
Closed 9 years ago.
How would I modify this IPv6 regex I wrote to either detect the address (ie the way the regex is written right now), but also accept "blank" ie the user did not specify an IPv6 address?
^[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}$
Right now, the regex is looking for a minimum of 0:0:0:0:0:0:0:0 or similar. Infact in addition to a blank address, I probably need to also be able to handle compression such as the following address:
FE80::1
or ::1
etc
Thanks!
* UPDATE *
So let me make sure I have this straight...
(^$|^IPV4)\|(^$|IPV6)\|REST OF STUFF$
That doesn't seem right. I feel like I have misplaced the ^ and $ and the very beginning and end of my entire regex.
Maybe this instead:
^(^$|IPV4)\|(^$|IPV6)\|REST$
* UPDATE *
Still no luck. Here is part of my code with the middles chopped out for sanity:
^(|[0-9]{1,3}.<<<OMIT MIDDLE IPV4>>>.[0-9]{1,3})\|(|(\A([0-9a-f]{1,4}:){1,1}<<<OMIT MIDDLE IPV6>>>[0-1]?\d?\d)){3}\Z))\|[a-zA-Z0-<<<MORE STUFF MIDDLE OMITTED>>>{0,50}$
I hope that isn't confusing. Thats the beginning and end of each regex with the middles omitted so you can see the ( ).
Perhaps I need to enclose the entire gigantic IPV6 regex in parenthesis?
* UPDATE *
Tried last statement above... no luck.
You can specify alternation with the | character, so a|b means "match either a or b". In this case it would look something like this:
^$|^[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}:[0-9a-fA-F]{1,4}$
The regex ^$ will match empty strings, so ^$|<current-regex> means "match either an empty string, or whatever <current-regex> matches (in this case IPv6)". You could use ^\s*$ in place of ^$ if you want strings that only consist of whitespace character to also be considered "empty".
This just handles the first part of the question, handling the compression like FE80::1 is more complex and it looks like there are already some other good answers for that in comments (note that I don't think this question is a dupe, because the "also matching an empty string" part isn't present in those questions).
edit: If it is part of a larger regex, then you should wrap everything in a group and get rid of the ^$, so it would be something like (|<current-regex>). Since there is nothing before the |, it means that the group can match either empty strings or whatever your current regex would match.
According to this post on this site called Stack Overflow this other site has an explanation & example of a huge—but very usable—regex which is this:
(\A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,6}\Z)|
(\A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,5}\Z)|
(\A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,4}\Z)|
(\A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,3}\Z)|
(\A([0-9a-f]{1,4}:){1,5}(:[0-9a-f]{1,4}){1,2}\Z)|
(\A([0-9a-f]{1,4}:){1,6}(:[0-9a-f]{1,4}){1,1}\Z)|
(\A(([0-9a-f]{1,4}:){1,7}|:):\Z)|
(\A:(:[0-9a-f]{1,4}){1,7}\Z)|
(\A((([0-9a-f]{1,4}:){6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3})\Z)|
(\A(([0-9a-f]{1,4}:){5}[0-9a-f]{1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3})\Z)|
(\A([0-9a-f]{1,4}:){5}:[0-9a-f]{1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,1}(:[0-9a-f]{1,4}){1,4}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,2}(:[0-9a-f]{1,4}){1,3}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,3}(:[0-9a-f]{1,4}){1,2}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A([0-9a-f]{1,4}:){1,4}(:[0-9a-f]{1,4}){1,1}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A(([0-9a-f]{1,4}:){1,5}|:):(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)|
(\A:(:[0-9a-f]{1,4}){1,5}:(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\Z)