Im currently using a Regex procedure in Alteryx to recognize an employee number in a PDF document and split the document into individual pdfs based on ee number.
RegEx in alteryx flow
Essentially what it does is find the term "Employee" on each page, returns the proceeding six digit number, splits the page out and renames the file using that number. This has, so far, worked fine.
However I have had some errors/kickouts and honestly I want to be more sure about the process, so my question is this:
Is there a way to have the regex point to a list of employee numbers (say in excel) and split the pages based on matching numbers within the pdf file?
Any help would be greatly appreciated.
dave
The RegEx can't do it, but with Alteryx, just have another data stream that reads the list of Excel numbers, then join your data stream to that. Assume your data stream is the L input, and the valid EmpNo list is the R input. Then:
The L output is invalid data stream records: save these for further analysis.
The J output is valid data stream records: continue processing them.
The R output is valid employees not represented in the data stream; retain for further review if interested.
Related
I am trying to create an MDM file using HLM 7 Student version, but since I don't have access to SPSS I am trying to import my data using ASCII input. As part of this process I am required to input the data format Fortran style. Try as I might I have not been able to understand this step. Could someone familiar with Fortran (or even better HLM itself) explain to me how this works? Here is my current understanding
From the example EG3.DAT they give
(A4,1X,3F7.1)
I think
A4 signifies that the ID is 4 characters long.
1X means skip a space.
F.1 means that it should read 1 decimal places.
I am very confused about what 3F7 might mean.
EG3.DAT
2020 380.0 40.3 12.5
2040 502.0 83.1 18.6
2180 777.0 96.6 44.4
Below are examples from the help documents.
Rules for format statement
Format statement example
EG1 data format
EG2 data format
EG3 data format
One similar question is Explaining Fortran Write Format. Unfortunately it does not explicitly treat the F descriptor.
3F7.1 means 3 floating point numbers, each printed over 7 characters, each with one decimal number behind the decimal point. Leading characters are blanks.
For reading you don't need the .1 info at all, just read a floating point number from those 7 characters.
You guessed the meaning of A4 (string of four characters) and 1X (one blank) correctly.
In Fortran, so-called data edit descriptors (which format the input or output of data) may have repeat specifications.
In the format (A4,1X,3F7.1) the data edit descriptors are A4 and F7.1. Only F7.1 has a repeat specification (the number before the F). This simply means that the format is as though the descriptor appeared repeated: like F7.1, F7.1, F7.1. With a repeat specification of 1, or not given, there is just the single appearance.
The format of the question, then, is like
(A4,1X,F7.1,F7.1,F7.1)
This format is one that is covered by the rules provided in one of the images of the question. In particular, the aspect of repeat specification is given in rule 2 with the corresponding example of rule 3.
Further, in Fortran proper, a repeat count specifier may also be * as special case: that's like an exceptionally large repeat count. *(F7.1) would be like F7.1, F7.1, F7.1, .... I see no indication that this is supported by HLM but if this is needed a very large repeat count may be given instead.
In 1X the 1 isn't a repeat specification but an integral, and necessary, part of the position edit descriptor.
Procedure for making MDM file from excel for HLM:
-Make sure ALL the characters in ALL the columns line up
Select a column, then right click and select Format Cells
Then click on 'Custom' and go to the 'Type' box and enter the number
of 0s you need to line everything up
-Remove all the tabs from the document and replace them with spaces.
Open the document in word and use find and replace
-To save the document as .dat
First save it as .txt
Then open it in Notepad and save it as .dat
To enter the data format (FORTRAN-Style)
The program wants to read the data file space by space, so you have to specify it perfectly so that it reads the whole set properly.
If something is off, even by a single space, then your descriptive stats will be wonky compared to if you check them in another program.
Enclose the code with brackets ()
Divide the entries with commas ,
-Need ID column for all levels
ID column needs to be sorted so that it is in order from smallest to
largest
Use A# with # being the number of characters in the ID
Use an X1 to
move from the ID to the next column
-Need to say how many characters are needed in each column
Use F
After F is the number of characters needed for that column -Use F# (#= number)
There need to be enough character spaces to provide one 'gap' space
between each column
There need to be enough to character spaces to allow for the decimal
As part of the F you need to specify the number of decimal places
You do this by adding a decimal point after the F number and then a
number to represent the spaces you need -F#.#
You can use a number in front of the F so as to 'repeat' it. Not
necessary though. -#F#.#
All in all, it should look something like this:
(A4,X1,F4.0,F5.1)
Helpful links:
https://books.google.de/books?id=VdmVtz6Wtc0C&pg=PA78&lpg=PA78&dq=data+format+fortran+style+hlm&source=bl&ots=kURJ6USN5e&sig=fdtsmTGSKFxn04wkxvRc2Vw1l5Q&hl=en&sa=X&ved=0ahUKEwi_yPurjYrYAhWIJuwKHa0uCuAQ6AEIPzAC#v=onepage&q&f=false
http://www.ssicentral.com/hlm/help6/error/Problems_creating_MDM_files.pdf
http://www.ssicentral.com/hlm/help7/faq/FAQ_Format_specifications_for_ASCII_data.pdf
My present data like below,It contains 100 rows
1,Ads,,12,CDMA,,12
2,,12,14,CDMA,,12
..
...
100,DVS,13,,CDMA,12,22
i have using GetFile-->SplitText-->ExtractText to split the data in row using 10 regex attributes for my present data.
For example my one of the input regex is (.+),(.+),,(.+),(.+),(.+) It will split the regex.1,regex.2 upto regex.5
For this data in ExtractText processor i have given 10 regex attributes to match all values in present data.
In Future there is another 100 rows will be added to present data.So i have to write regex attribute for future 100 lines also.
I need to add expression language support for all columns in extracted data in Processor also.
Is it possible to give common regex for all data in ExtractText processor?
Is there is anyother way to extract the data by delimiter like comma,pipe symbol in NIFI?
Any help appreciated.
Please anyone help me to solve this
I just find common regex for extract my data from csv file.,
([^,]*?),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)
It could be huge expensive if it might be better than this (.+),(.+),,(.+),(.+),(.+)
It may be helpful for someone.
I've written a program to parse through a log file. The file in question is around a million entries or so, and my intent is to remove all duplicate entries by date. So if there's 100 unique log-ins on a date, it will only show one log-in per name. The log output I've created is in the form:
AA 01/Jan/2013
AA 01/Jan 2013
BB 01/Jan 2013
etc. etc. all through the month of January.
This is what I've written so far, the constant i in the for loop is the amount of entries to be sorted through and namearr & datearr are the arrays used for name and date. My end game is to have no repeated values in the first field that correspond to each date. I'm trying to follow proper etiquette and protocols so if I'm off base with this question I apologize.
My first thought in solving this myself is to nest a for loop to compare all previous names to the date, but since I'm learning about Data Structures and Algorithm Analysis, I don't want to creep up to high run times.
if(inFile.is_open())
{
for(int a=0;a<i;a++)
{
inFile>>name;//Take input file name
namearr[a]=name;//Store file name into array
//If names are duplicates, erase them
if(namearr[a]==temp)
{
inFile.ignore(1000,'\n');//If duplicate, skip to next line
}
else
{
temp=name;
inFile.ignore(1,' ');
inFile>>date;//Store date
datearr[a]=date;//Put date into array
inFile.ignore(1000,'\n');//Skip to next like
cout<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to window
oFile<<namearr[a]<<" "<<datearr[a]<<endl;//Output code to file
}
}
}
Ughhh... You better use a Regular Expression library to easily deal with that size of a file. Check Boost Regex
http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/index.html
You can construct a key composed of the name and the date with simple string concatenation. That string becomes the index to a map. As you are processing the file line by line, check to see if that string is already in the map. If it is, then you have encountered the name on that day once before. If you've seen it already do one thing, if it's new do another.
This is efficient because you're constructing a string that will only be found a second time if the name has already been seen on that date and maps efficiently search the space of keys to find if a key exists in the map or not.
I'm working with a big excel file that has a lot of information on businesses my company works with. I just imported another large excel file and their was a difference in format. The larger file we already have has the address, state and zip code in separate columns each spaced two apart like so:
I didn't make this spreadsheet or else I wouldn't have put the columns like that, but thats how the lady that works with it likes it.
The problem is that the sheet I imported has the city, state, and zip info all in the same cell like this:
Trollville, NY 12345
I have already over the states since 99% of the new ones were all the same state which a quick find and replace all worked. I'm now left with this
Trollville 12345
I want to move that zip code four columns to the right into the proper cell. I wrote a basic regex but don't know much about excel-vba since I haven't used it in years, but this is what I've come up with. I just don't know how to tell vba to print output the matches (which I made into an array) into the appropriate column. This is what I have so far:
Function findZipCode(zipCode)
Dim regEx As New VBScript_RegExp_55.RegExp
Dim matches, s
regEx.Pattern = "\s\d{5}\W"
regEx.Global = True
s = ""
If regEx.Test(zipCode) Then
Set matches = regEx.Execute(zipCode)
For Each Match In matches
s = s & Match.Value
Next
findZipCode = s
Else
findZipCode = ""
End If
End Function
What do I need to add? I'm open to alternative methods too if there is an easier way to do this.
Thanks in advance for the advice
Can you use the in-built Excel Worksheet Functions?
Place this in target column =RIGHT(A2,5) would capture the rightmost 5 characters of your string iff they are numeric. This will work if all of your values data values have a 5-digit zip codes at the end.
Alternatively, you could wrap it with a conditional such as IF(ISNUMBER(VALUE(RIGHT(A2,5))),RIGHT(A2,5),""), whcih would add a layer of validation to the process.
Also, did you know there is an option that may do this for you automatically if your data is comma (or space) delimited Data ribbon->Text to columns
I am looking to convert a scientific notation number into the full integer number.
E.g:
8.18234E+11 => 818234011668
Excel reformatted all my upc codes within a csv and this solution is not working for me.
I have my csv open in Notepad++ and would love to do this using a regex find and replace.
Thanks.
The damage is already done and cannot be recovered from the CSV file. 8.18234E+11 could be anything* from 818233500000 to 818234499999.
To prevent Excel from rounding large numbers, you need to store them as text. If you set the cell format to text, any value inserted from then on should be automatically interpreted as text. In OpenOffice Calc (I don't have MS Excel), you can also prefix a numeric value with ' to get it interpreted as text no matter the cell format.
There is a chance that the correct value is stored in the original XLS (or XSLX or ODS or the live Excel session or ...) file. If not, then you'll have to enter the data again. If the data is there, you need to store it as text or increase the number of significant digits in the exported CSV. If you only have the exported data, then you're out of luck.
*UPC has a single check digit, so only 100 000 out of the 1 000 000 codes are actually valid UPC codes.