I have a large string. Here is a part of it:
{"status":"ok","items":[{"image_versions":[{"url":"http:\/\/distilleryimage8.instagram.com\/11a67042c62311e1bf341231380f8a12_7.jpg","width":612,"type":7,"height":612},{"url":"http:\/\/distilleryimage8.instagram.com\/11a67042c62311e1bf341231380f8a12_6.jpg","width":306,"type":6,"height":306},{"url":"http:\/\/distilleryimage8.instagram.com\/11a67042c62311e1bf341231380f8a12_5.jpg","width":150,"type":5,"height":150}],"code":"MrMBxJo-O8","has_more_comments":true,"taken_at":1341438972.0,"comments":[{"media_id":228329104165036988,"_spam":false,"text":"I live in Oklahoma! :D Shoot them off with me! :D","created_at":1341441914.0,"user":{"username":"heather_all_over","pk":13296276,"profile_pic_url":"http:\/\/images.instagram.com\/profiles\/profile_13296276_75sq_1339538236.jpg","full_name":"Heather\ud83c\udf80","is_private":false},"content_type":"comment","pk":228353791620276525,"type":0},{"media_id":228329104165036988,"_spam":false,"text":"Wish I had that much money to spend.......","created_at":1341441916.0,"user":{"username":"l_mcnair","pk":23775741,"profile_pic_url":"http:\/\/images.instagram.com\/profiles\/profile_23775741_75sq_1339894045.jpg","full_name":"Lauryn","is_private":true},"content_type":"comment","pk":228353803204944174,"type":0},{"media_id":228329104165036988,"_spam":false,"text":"You should video tape you setting them all off","created_at":1341441939.0,"user":{"username":"ahrii_","pk":37732021,"profile_pic_url":"http:\/\/images.instagram.com\/profiles\/profile_37732021_75sq_1340907381.jpg","full_name":"Ahriana;-*","is_private":false},"content_type":"comment","pk":228353997065675057,"type":0},{"media_id":228329104165036988,"_spam":false,"text":"When did skrillex start selling
I am trying to match every number after "pk":". I have been trying look aheads but can't quite seem to get it right. I don't know much about regex so if somebody could point me in the right direction that would be great!
This looks like a JSON response. Why not just parse the JSON and pull out the values for all the "pk" keys?
Depending on what language you're using, the regex might look different, but this should work on most languages:
/"pk":(\d+)/g
That basically looks for the string "pk": and then all the digits after that, placing those digits in a capturing group. The g at the end makes it search for all occurrences. Depending on the language you're using though, you might not be able to retrieve all of captures.
If you want the part after something you should use look-behind:
(?<="pk":)\d+
Related
I am looking to extract some text from a raw credit card feed for a workflow. I have gotten almost where I want to but am struggling with the final piece of information I'm trying to extract.
An example of the raw feed is:
LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
I am looking to extract this from the above:
(ICGROUP,INC.MELBOURNE)June5UNITEDSTATESDOLLARAUD(50.07)includesconversioncommissionof
with the brackets representing the two groups I am after. The consistent parts across all instances of what I'm trying to extract is:
DIGITS (TEXT) DATE TEXT AMOUNT includesconversioncommissionof
I have been able to use the regex:
([A-Z][a-z]\d)[A-Z]AUD(\d\,?\d+?.\d*)includesconversioncommissionofAUD
to get me the date and the amount. I am struggling to find a way to get as per the example above the words ICGROUP,INC.MELBOURNE
I have tried putting \d\d(.*) before the above regex but that doesn't work for some reason.
Would appreciate if anyone is able to help with what I'm after!
The closest I think we can get (PCRE) is something like:
/
[\d,.]+ # a currency value to bookend
(.+?) # capture everything in-between
[A-Z][a-z]+\d+ # a month followed by a day, e.g. "June5"
.+? # everything in-between
([\d,.]+) # capture a currency value
includesconversioncommissionof # our magic token to bookend
/x
The technique here is to pit greedy expressions against non-greedy expressions in a very deliberate way. Let me know if you have any questions about it. I would be extremely hesitant to put this in production—or even trust its output as an ad-hoc pass—without rigorous testing!
I'm using the pattern [\d,.] for currency, but you can replace that with something more sophisticated, especially if you expect weird formats and currency symbols. The biggest potential pitfall here is if the ICGROUP,INC.MELBOURNE token might start with a number. Then you'll definitely need a more sophisticated currency pattern!
Here's what I've got (in php).
$string = "LEO'SFINEFOOD&WINEHARTWELLJune350.0735.00ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE";
$cleaned = preg_replace("/^(LEO'SFINEFOOD&WINEHARTWELL)([A-Za-z]{3,9})(\.|\d)*/", "", $string);
echo $cleaned;
what it returns is: ICGROUP,INC.MELBOURNEJune5UNITEDSTATESDOLLARAUD50.07includesconversioncommissionofAUD1.469.96WOOLWORTHS3335CHADSTOCHADSTONE
Which you can then use and run your own little regex on.
Explanation:
The \w{3,9} is used to remove the month which may be 3-9 characters long. Then the (\.|\d)* is to remove the digits and dots. I'm thinking that we could parse the month/date better using your regex to extract that June 5 part but from your example given, it shouldn't be necessary.
However, it would be much more helpful if you could provide at least 3 examples, optimally 5, so we can get a good feel of the pattern. Otherwise this is the best I can do with what you've given.
I am pretty new to the concept of regex and so I am hoping an expert user can help me craft the right expression to find all the matches in a string. I have a string that represents a lot of support information in it for vulnerabilities data. In that string are a series of CVE references in the format: CVE-2015-4000. Can anyone provide me a sample regex on finding all occurrences of that ? obviously, the numeric part of that changes throughout the string...
Generally you should always include your previous efforts in your question, what exactly you expect to match, etc. But since I am aware of the format and this is an easy one...
CVE-\d{4}-\d{4,7}
This matches first CVE- then a 4-digit number for the year identifier and then a 4 to 7 digit number to identify the vulnerability as per the new standard.
See this in action here.
If you need an exact match without any syntax or logic violations, you can try this:
^(CVE-(1999|2\d{3})-(0\d{2}[1-9]|[1-9]\d{3,}))$
You can run this against the test data supplied by MITRE here to test your code or test it online here.
I will add my two cents to the accepted answer. Incase we want to detect case insensitive "CVE" we can following regex
r'(?i)\bcve\-\d{4}-\d{4,7}'
For some fields, I uploaded. I want to make sure they didn't get corrupted (not by mongo, but my data generator).
The field of interest would take this regex:
donor_\d{1,2}_\d+
for example:
donor_17_82635294
There is no exception to that rule, so I was wondering if I could use negative look around in regex to find fields that don't meet this rule. The problem with negative look around examples on SO is it seems you have to know what you are looking for, which I don't. I want something like this.
db.collection.find({field:*not*/donor_\d{1,2}_\d+/i})
My other option is just to create a new collection with everything that matches my regex, but this would be much easier.
Thanks
J
Yes you can do negation of regular expression like this:
db.collection.find({field: { $not: /donor_\d{1,2}_\d+/i } })
Hi i am currently implementing the following regex to prevent user submitting contents which contains profanity as describe within the regex
(?i)(pecan|tie|shirt|hole|ontology|meme|pelagic|cock|duck|slot|anjing lo|Banting|Chiba|Screw|Screwing|fat|where|mother|peer|per|sock|socker|locker|ans|rect|anal|pickpocket|joker|muck)\b
I would like to improve the regex so it also filter out credit card number (master, visa, jcb, amex and so on)
i have the regex for each card namely:
^4[0-9]{12}(?:[0-9]{3})?$ (Visa)
^5[1-5][0-9]{14}$ (Master)
^3[47][0-9]{13}$ (Amex)
^3(?:0[0-5]|[68][0-9])[0-9]{11}$ (Diners)
^6(?:011|5[0-9]{2})[0-9]{12}$ (Discover)
^(?:2131|1800|35\d{3})\d{11}$ (JCB)
However when i combine these credit card amex along with the profanity filter like this
(?i)(pecan|tie|shirt|hole|ontology|meme|pelagic|cock|duck|slot|anjing lo|Banting|Chiba|Screw|Screwing|fat|where|mother|peer|per|sock|socker|locker|ans|rect|anal|pickpocket|joker|muck)\b (?i)^4[0-9]{12}(?:[0-9]{3})?$\b (?i)^5[1-5][0-9]{14}$\b it will ignore the profanity filter.
Can anyone points me to the right direction?
Filtering profanity is a great example when NOT to use regex!... Anyone who wants to swear can easily get around your filter by typing "0" instead of "o", or inserting a "." in the middle of a word, or hundreds of other workarounds. There are much better alternatives out there, if you'd like to do some research. Anyway, ignoring that...
Firstly, do you really need to do this in a single regex pattern?! Your code would look much more readable and be more easily maintainable if you split this into multiple lines of code.
But if you really insist on doing it this way, your pattern is looking for a swear word, followed by a Visa number, followed by a Master number. You have not implemented any "OR" condition here.
This is one of the stupidest policy requirement I've ever seen. Your filter will miss a lot of profanities, and will trigger on non-profanities; see Scunthorpe problem.
Then, your credit card regexes already exclude all possible swearwords because they allow only digits, out of which it is going to be difficult to construct a swearword.
But if your boss insists, make him happy with
(?i)^(?!.*(pecun|tai|shit|asshole|kontol|memek|pelacur|cock|dick|slut|anjing lo|bangsat|cibay|fuck|fucking|faggot|whore|motherfucker|peler|pler|suck|sucker|fucker|anus|rectum|anal|cocksucker|sucker|suck)\b)4[0-9]{12}(?:[0-9]{3})?$
Hopefully someone can help me out. Been all over google now.
I'm doing some zone-ocr of documents, and want to extract some text with regex. It is always like this:
"Til: Name Name Name org.nr 12323123".
I want to extract the name-part, it can be 1-4 names, but "Til:" and "org.nr" is always before and after.
Anyone?
If you can't use capturing groups (check your documentation) you can try this:
(?<=Til:).*?(?=org\.nr)
This solution is using look behind and lookahead assertions, but those are not supported from every regex flavour. If they are working, this regex will return only the part you want, because the parts in the assertions are not matched, it checks only if the patterns in the assertions are there.
Use the pattern:
Til:(.*)org\.nr
Then take the second group to get the content between the parenthesis.