Searching for Social security number using Lucene 4 regexp

Searching for Social security number using Lucene 4 regexp - regex

I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444 or 222 33 4444.
As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444 to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/" but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?

If you simply have a field of identifiers, or some such, use a StringField, or some other untokenized field, in which case a simple RegExpQuery is simple enough to define.
If you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery API to construct the appropriate query:
SpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{3}")));
SpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{2}")));
SpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{4}")));
Query query = new SpanNearQuery({span1, span2, span3}, 0, true);
searcher.search(query, maxResults)

You can use the INTERVAL flag:
/<000-999>/ /<00-99>/ /<0000-9999>/
> INTERVAL

I do not know about lucene, but this regex works:
'\d{3}[ \-]\d{2}[ \-]\d{4}'
It matches both:
222 33 4444
and
222-33-4444

Related

Parsing a name from a complex string in Tableau

I have a series of values in Tableau that are long strings intermixed with letters and numbers. I am unable to control the data output, but would like to parse the names from these strings. They follow the following format:
Potato 1TByte 4.5 NFA
Board 256GByte 553 NCA
Launch 4 512GByte 4.5 NFA
Launch 4S 512GByte 4.5 NCA
From each of these, I am attempting to capture the following:
"Potato"
"Board"
"Launch 4"
"Launch 4S"
Each string follows the same format: the name, followed by size, followed by some extra information we don't really care about.
I've tried to put together some text parsing strings, but am coming up short, and am still trying to learn regular expressions.
The Tableau calculated field I was trying to work with was something like the following:
LEFT([String], FIND([String], "Byte") - 2)
The issue is that the text and numbers preceding Byte can be anywhere from 4 to 2 characters and I need a way to identify the length of that.
Any help would be greatly appreciated!

One option which uses a regex replacement:
REGEXP_REPLACE('Launch 4 512GByte 4.5 NFA', ' \d+[A-Z]Byte .*$', '')
This strips off everything from the Byte term to the right, leaving us with only the product name.

You could try the following - this seems to work - Screenshot of Tableau output. Find below the formulas for the various derived columns you see in the screenshot (Your source column is called [Name])
Step1 = LEFT([Name],FIND([Name],"Byte")-1)
Step2 = LEN([Step1])-LEN(REPLACE([Step1]," ",""))
Step3 = FINDNTH([Step1]," ",[Step2])
Step4 = LEFT([Step1],[Step3]-1)
And of course you can nest all these in a single calculated field - kept them as separate columns for easier understanding

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.

Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

AWQL - how can i use a regular expressions or something similar?

I am querying the adwords api via the following AWQL-Query (which works fine):
SELECT AccountDescriptiveName, CampaignId, CampaignName, AdGroupId, AdGroupName, KeywordText, KeywordMatchType, MaxCpc, Impressions, Clicks, Cost, Conversions, ConversionsManyPerClick, ConversionValue
FROM KEYWORDS_PERFORMANCE_REPORT
WHERE CampaignStatus IN ['ACTIVE', 'PAUSED']
AND AdGroupStatus IN ['ENABLED', 'PAUSED']
AND Status IN ['ACTIVE', 'PAUSED']
AND AdNetworkType1 IN ['SEARCH'] AND Impressions > 0
DURING 20140501,20140531
Now i want to exclude some campaigns:
we have a convention for our new campaigns that the campaign name begins with three numbers followed by an underscore, eg. "100_brand_all"
So i want to get only these new campaigns..
I tried lots of different variations for STARTS_WITH but only exact strings are working - but i need a pattern to match!
I already read https://developers.google.com/adwords/api/docs/guides/awql?hl=en and following its content it should be possible to use a WHERE expression like this:
CampaignName STARTS_WITH ['0','1','2','3']
But that doesn't work!
Any other ideas how i can achieve this?

Well, why don't you run a campaign performance report first, then process that ( get the campaign ids you want or don't want) the use those in the "CampaignId IN [campaign ids here] . or CampaignID NOT_IN [campaign ids]

Best way to compare phone numbers using Regex

I have two databases that store phone numbers. The first one stores them with a country code in the format 15555555555 (a US number), and the other can store them in many different formats (ex. (555) 555-5555, 5555555555, 555-555-5555, 555-5555, etc.). When a phone number unsubscribes in one database, I need to unsubscribe all references to it in the other database.
What is the best way to find all instances of phone numbers in the second database that match the number in the first database? I'm using the entity framework. My code right now looks like this:
using (FusionEntities db = new FusionEntities())
{
var communications = db.Communications.Where(x => x.ValueType == 105);
foreach (var com in communications)
{
string sRegexCompare = Regex.Replace(com.Value, "[^0-9]", "");
if (sMobileNumber.Contains(sRegexCompare) && sRegexCompare.Length > 6)
{
var contact = db.Contacts.Where(x => x.ContactID == com.ContactID).FirstOrDefault();
contact.SMSOptOutDate = DateTime.Now;
}
}
}
Right now, my comparison checks to see if the first database contains at least 7 digits from the second database after all non-numeric characters are removed.
Ideally, I want to be able to apply the regex formatting to the point in the code where I get the data from the database. Initially I tried this, but I can't use replace in a LINQ query:
var communications = db.Communications.Where(x => x.ValueType == 105 && sMobileNumber.Contains(Regex.Replace(x.Value, "[^0-9]", "")));

Comparing phone numbers is a bit beyond the capability of regex by design. As you've discovered there are many ways to represent a phone number with and without things like area codes and formatting. Regex is for pattern matching so as you've found using the regex to strip out all formatting and then comparing strings is doable but putting logic into regex which is not what it's for.
I would suggest the first and biggest thing to do is sort out the representation of phone numbers. Since you have database access you might want to look at creating a new field or table to represent a phone number object. Then put your comparison logic in the model.
Yes it's more work but it keeps the code more understandable going forward and helps cleanup crap data.

How to find all the source lines containing desired table names from user_source by using 'regexp'

For example we have a large database contains lots of oracle packages, and now we want to see where a specific table resists in the source code. The source code is stored in user_source table and our desired table is called 'company'.
Normally, I would like to use:
select * from user_source
where upper(text) like '%COMPANY%'
This will return all words containing 'company', like
121 company cmy
14 company_id, idx_name %% end of coding
453 ;companyname
1253 from db.company.company_id where
989 using company, idx, db_name,
So how to make this result more intelligent using regular expression to parse all the source lines matching a meaningful table name (means a table to the compiler)?
So normally we allow the matched word contains chars like . ; , '' "" but not _
Can anyone make this work?

To find company as a "whole word" with a regular expression:
SELECT * FROM user_source
WHERE REGEXP_LIKE(text, '(^|\s)company(\s|$)', 'i');
The third argument of i makes the REGEXP_LIKE search case-insensitive.
As far as ignoring the characters . ; , '' "", you can use REGEXP_REPLACE to suck them out of the string before doing the comparison:
SELECT * FROM user_source
WHERE REGEXP_LIKE(REGEXP_REPLACE(text, '[.;,''"]'), '(^|\s)company(\s|$)', 'i');
Addendum: The following query will also help locate table references. It won't give the source line, but it's a start:
SELECT *
FROM user_dependencies
WHERE referenced_name = 'COMPANY'
AND referenced_type = 'TABLE';

If you want to identify the objects that refer to your table, you can get that information from the data dictionary:
select *
from all_dependencies
where referenced_owner = 'DB'
and referenced_name = 'COMPANY'
and referenced_type = 'TABLE';
You can't get the individual line numbers from that, but you can then either look at user_source or use a regexp on the specific source code, which woudl at least reduce false positives.

SELECT * FROM user_source
WHERE REGEXP_LIKE(text,'([^_a-z0-9])company([^_a-z0-9])','i')
Thanks #Ed Gibbs, with a little trick this modified answer could be more intelligent.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Searching for Social security number using Lucene 4 regexp - regex

You can use the INTERVAL flag: /<000-999>/ /<00-99>/ /<0000-9999>/ > INTERVAL

I do not know about lucene, but this regex works: '\d{3}[ \-]\d{2}[ \-]\d{4}' It matches both: 222 33 4444 and 222-33-4444

Related

Parsing a name from a complex string in Tableau

Using Regex in Pig in hadoop

AWQL - how can i use a regular expressions or something similar?

Best way to compare phone numbers using Regex

How to find all the source lines containing desired table names from user_source by using 'regexp'

Categories

Resources