1. Parsing problem
- Posted by Patrick Barnes <mrtrick at gmail.com> Aug 16, 2005
- 616 views
I have some finance data that my bank insists on lumping all together into one field. A sample of the data is below: Visa Purchase 26DECThe Carss Park Super Carss Par Non Stg/Bsa Atm Wdl Fee Internet Deposit 05JAN23:11itamoney Visa Purchase 03JANWorld Vision Of Aust Burwood E Adi Limited28686 Visa Cash Advance 07JANEur210.00 Banco Di Brescia Visa Purchase 18JANOptus Tv/Net Autopay Chatswood Internet Deposit 25JAN10:05shopping Atm Withdrawal 25JAN11:09Westpaccrlngfrd 2 O/S Carlingfor2= Au Adi Limited28686 Atm Withdrawal 27JAN07:37St.George Telopea Hair Telopea Nsw \= Au Atm Withdrawal -Cba 27JAN09:20Cba Atm Uts B'Way Op Nsw 228498 = Aus Visa Cash Advance 21JANEur120.00 Banco Di Brescia Atm Withdrawal -Cba 28JAN08:48Cba Atm Uts B'Way Op Nsw 228498 = Aus Visa Purchase 25JANWoolworths W1122 Carlingfo Visa Purchase 26JANColes Express Dundas Dundas Visa Purchase 25JANVodafone Chatswood Internet Deposit 29JAN09:57billls.... Eftpos Purchase 29JAN10:28N R M A Mcc North Ryde Atm Withdrawal - O/Bank 31JAN12:34Garden Island Sydney = Au O/Seas Cash Withdrawal Fee Visa Purchase 27JANBilo Telopea 4106 Telopea Non Stg/Bsa Atm Wdl Fee Atm Withdrawal 01FEB19:25St.George Telopea Hair Telopea Nsw \= Au Visa Purchase 29JANIkea Homebush Bay Rhodes Visa Purchase 28JANWoolworths W1200 Eastwood Visa Purchase 29JANPlatinum Communicatn Rhodes Visa Purchase 29JANCaci Clinic Edgecliff Visa Purchase 31JANWorld Vision Of Aust Burwood E Visa Purchase 28JANVodafone Chatswood Visa Purchase 31JANWoolworths W1200 Eastwood Tfr Wdl Bpay Internet 03FEB19:0311174893 Integral Energy Each line is a single field in the incoming .csv file. Any suggestions on how to parse it? In most but not all cases, the second 'column' starts at about the 29th element. Sometimes the date is given, sometimes data and time, sometimes nothing. -- MrTrick ----------
2. Re: Parsing problem
- Posted by don cole <doncole at pacbell.net> Aug 16, 2005
- 587 views
Patrick Barnes wrote: > > I have some finance data that my bank insists on lumping all together > into one field. A sample of the data is below: > > Visa Purchase 26DECThe Carss Park Super Carss Par > Non Stg/Bsa Atm Wdl Fee > Internet Deposit 05JAN23:11itamoney > Visa Purchase 03JANWorld Vision Of Aust Burwood E > Adi Limited28686 > Visa Cash Advance 07JANEur210.00 Banco Di Brescia > Visa Purchase 18JANOptus Tv/Net Autopay Chatswood > Internet Deposit 25JAN10:05shopping > Atm Withdrawal 25JAN11:09Westpaccrlngfrd 2 O/S Carlingfor2= > Au > Adi Limited28686 > Atm Withdrawal 27JAN07:37St.George Telopea Hair Telopea Nsw \= > Au > Atm Withdrawal -Cba 27JAN09:20Cba Atm Uts B'Way Op Nsw 228498 = > Aus > Visa Cash Advance 21JANEur120.00 Banco Di Brescia > Atm Withdrawal -Cba 28JAN08:48Cba Atm Uts B'Way Op Nsw 228498 = > Aus > Visa Purchase 25JANWoolworths W1122 Carlingfo > Visa Purchase 26JANColes Express Dundas Dundas > Visa Purchase 25JANVodafone Chatswood > Internet Deposit 29JAN09:57billls.... > Eftpos Purchase 29JAN10:28N R M A Mcc North Ryde > Atm Withdrawal - O/Bank 31JAN12:34Garden Island Sydney = > Au > O/Seas Cash Withdrawal Fee > Visa Purchase 27JANBilo Telopea 4106 Telopea > Non Stg/Bsa Atm Wdl Fee > Atm Withdrawal 01FEB19:25St.George Telopea Hair Telopea Nsw \= > Au > Visa Purchase 29JANIkea Homebush Bay Rhodes > Visa Purchase 28JANWoolworths W1200 Eastwood > Visa Purchase 29JANPlatinum Communicatn Rhodes > Visa Purchase 29JANCaci Clinic Edgecliff > Visa Purchase 31JANWorld Vision Of Aust Burwood E > Visa Purchase 28JANVodafone Chatswood > Visa Purchase 31JANWoolworths W1200 Eastwood > Tfr Wdl Bpay Internet 03FEB19:0311174893 Integral Energy > > Each line is a single field in the incoming .csv file. > Any suggestions on how to parse it? In most but not all cases, the > second 'column' starts at about the 29th element. Sometimes the date > is given, sometimes data and time, sometimes nothing. > > -- > MrTrick > ---------- > > I would start of with something like:
function splitline(sequence line) integer a a=match(" ",line) return {trim(line[1..a])}&{trim(line[a..length(line)])} end function
---first pass---------- while 1 do line=gets(fn) if match(" ",line) then good=splitline(line) else good=line end if end while
------------second pass------- col=repeat({},2) col=repeat(col,length(good)) for x=1 to length(good) do if length(good)=2 then col[1]=good[1] col[2]=good[2] else col[1]=good[1] col[2]={} end if end for
from there you could put in match("JAN",col[2])
or
for x=1 to length(SHORTMONTHS) dosee my moredates if match(SHORTMONS[x],col[2]) then a=match(SHORTMONTHS[x],col[2] month=col[2][1..3] day=trim(col[2][4..5]) end if end for etc. etc.. you could cut out any information you wanted in this fashsion.
Don Cole, SF }}}
3. Re: Parsing problem
- Posted by cklester <cklester at yahoo.com> Aug 16, 2005
- 623 views
Patrick Barnes wrote: > > I have some finance data that my bank insists on lumping all together > into one field. A sample of the data is below: > > Visa Purchase 26DECThe Carss Park Super Carss Par > Non Stg/Bsa Atm Wdl Fee > Internet Deposit 05JAN23:11itamoney > Visa Purchase 03JANWorld Vision Of Aust Burwood E > Adi Limited28686 > ... > Each line is a single field in the incoming .csv file. That's horrible. You should sue them. :D > Any suggestions on how to parse it? It looks like the first field is standard width (though you said that varies), I've had to write code to parse inconsistent files before. It's not that difficult as long as you can get a handle on the exceptions. I'd start by splitting the long space between the 1st and 2nd fields. Then I'd search for the times and dates (use the colon) and split that up. Depending on how many lines you have, you might want to just do it manually. Ouch. -=ck "Programming in a state of EUPHORIA." http://www.cklester.com/euphoria/
4. Re: Parsing problem
- Posted by DB James <larch at adelphia.net> Aug 16, 2005
- 618 views
Patrick Barnes wrote: > > I have some finance data that my bank insists on lumping all together > into one field. A sample of the data is below: > > Visa Purchase 26DECThe Carss Park Super Carss Par > Non Stg/Bsa Atm Wdl Fee > Internet Deposit 05JAN23:11itamoney <SNIP> > Visa Purchase 31JANWorld Vision Of Aust Burwood E > Visa Purchase 28JANVodafone Chatswood > Visa Purchase 31JANWoolworths W1200 Eastwood > Tfr Wdl Bpay Internet 03FEB19:0311174893 Integral Energy > > Each line is a single field in the incoming .csv file. > Any suggestions on how to parse it? In most but not all cases, the > second 'column' starts at about the 29th element. Sometimes the date > is given, sometimes data and time, sometimes nothing. > > -- > MrTrick > ---------- Hi. It appears as if the second set of data in each line always begins with a number, so you have a basis for separation there, as well as the tabbing or spacing otherwise. (Question: are the wide spaces tabs originally? If so, that would make parsing easy.) Also the first part of the second set either ends with a number, or it ends with a three-letter word for the month, so this can create a "rule" for separation of the second set. It would be handy if the bank would provide a copy of their rules for generationg these lines (some techy deep in the bowels of the bank would know, but they may not let him/her out of the cage for public communication :^D ) --Quark
5. Re: Parsing problem
- Posted by Karl Bochert <kbochert at copper.net> Aug 16, 2005
- 650 views
Patrick Barnes wrote: > > I have some finance data that my bank insists on lumping all together > into one field. A sample of the data is below: > > Visa Purchase 26DECThe Carss Park Super Carss Par > Non Stg/Bsa Atm Wdl Fee > Internet Deposit 05JAN23:11itamoney > Visa Purchase 03JANWorld Vision Of Aust Burwood E > Adi Limited28686 > Visa Cash Advance 07JANEur210.00 Banco Di Brescia > Visa Purchase 18JANOptus Tv/Net Autopay Chatswood > Internet Deposit 25JAN10:05shopping > Atm Withdrawal 25JAN11:09Westpaccrlngfrd 2 O/S Carlingfor2= > Au > Adi Limited28686 > Atm Withdrawal 27JAN07:37St.George Telopea Hair Telopea Nsw \= > Au > Atm Withdrawal -Cba 27JAN09:20Cba Atm Uts B'Way Op Nsw 228498 = > Aus > Visa Cash Advance 21JANEur120.00 Banco Di Brescia > Atm Withdrawal -Cba 28JAN08:48Cba Atm Uts B'Way Op Nsw 228498 = > Aus > Visa Purchase 25JANWoolworths W1122 Carlingfo > Visa Purchase 26JANColes Express Dundas Dundas > Visa Purchase 25JANVodafone Chatswood > Internet Deposit 29JAN09:57billls.... > Eftpos Purchase 29JAN10:28N R M A Mcc North Ryde > Atm Withdrawal - O/Bank 31JAN12:34Garden Island Sydney = > Au > O/Seas Cash Withdrawal Fee > Visa Purchase 27JANBilo Telopea 4106 Telopea > Non Stg/Bsa Atm Wdl Fee > Atm Withdrawal 01FEB19:25St.George Telopea Hair Telopea Nsw \= > Au > Visa Purchase 29JANIkea Homebush Bay Rhodes > Visa Purchase 28JANWoolworths W1200 Eastwood > Visa Purchase 29JANPlatinum Communicatn Rhodes > Visa Purchase 29JANCaci Clinic Edgecliff > Visa Purchase 31JANWorld Vision Of Aust Burwood E > Visa Purchase 28JANVodafone Chatswood > Visa Purchase 31JANWoolworths W1200 Eastwood > Tfr Wdl Bpay Internet 03FEB19:0311174893 Integral Energy > > Each line is a single field in the incoming .csv file. > Any suggestions on how to parse it? In most but not all cases, the > second 'column' starts at about the 29th element. Sometimes the date > is given, sometimes data and time, sometimes nothing. > > -- > MrTrick > ---------- > Seems like a problem for regular expressions. For instance: constant p_date = "([0..3][0..9](JAN|FEB|MAR|APR|MAY| <and so on ..> ) constant p_time = "(\d\d:\d\d)" constant p_item = "(.+(?!" & p_date & "))" constant p_tail = "(.*)" RGXscan (input_line, p_item & p_date & p_time & p_tail) then: RGXsubstring(2) -- returns the date, if any RGXsubstring(1) -- returns the part before the date RGXsubstring(3) -- returns the date N.B. The fuctions above come from my EU-PCRE -- other regex implementations will be similar. The code is not tested and is probably wrong in some details.
6. Re: Parsing problem
- Posted by jacques deschĂȘnes <desja at globetrotter.net> Aug 17, 2005
- 592 views
Hi Patrick, First thing to do, complain to your bank. Regards, jacques DeschĂȘnes
7. Re: Parsing problem
- Posted by Patrick Barnes <mrtrick at gmail.com> Aug 17, 2005
- 688 views
Thank you for all your suggestions, I'll try implementing them into a parse= r. (And those gaps are spaces, not tabs) -- MrTrick ---------------------------------------------------------------------------= ---------- Catapultum habeo. Nisi pecuniam omnem mihi dabis, ad caput tuum saxum immane mittam
8. Re: Parsing problem
- Posted by "Kat" <gertie at visionsix.com> Aug 17, 2005
- 596 views
On 16 Aug 2005, at 22:26, Patrick Barnes wrote: > > I have some finance data that my bank insists on lumping all together > into one field. A sample of the data is below: > > Visa Purchase 26DECThe Carss Park Super Carss Par > Non Stg/Bsa Atm Wdl Fee > Internet Deposit 05JAN23:11itamoney > Visa Purchase 03JANWorld Vision Of Aust Burwood E > Adi Limited28686 > Visa Cash Advance 07JANEur210.00 Banco Di Brescia > Visa Purchase 18JANOptus Tv/Net Autopay Chatswood > Internet Deposit 25JAN10:05shopping > Atm Withdrawal 25JAN11:09Westpaccrlngfrd 2 O/S Carlingfor2 > Au > Adi Limited28686 Atm Withdrawal 27JAN07:37St.George Telopea Hair > Telopea Nsw \Au Atm Withdrawal -Cba 27JAN09:20Cba Atm Uts B'Way Op > > Nsw 228498 Aus Visa Cash Advance 21JANEur120.00 Banco Di Brescia > Atm Withdrawal -Cba 28JAN08:48Cba Atm Uts B'Way Op Nsw 228498 > Aus > Visa Purchase 25JANWoolworths W1122 Carlingfo Visa Purchase > > 26JANColes Express Dundas Dundas Visa Purchase > 25JANVodafone Chatswood Internet Deposit > 29JAN09:57billls.... Eftpos Purchase 29JAN10:28N R M A Mcc > North Ryde Atm Withdrawal - O/Bank 31JAN12:34Garden Island > Sydney Au O/Seas Cash Withdrawal Fee Visa Purchase > 27JANBilo Telopea 4106 Telopea Non Stg/Bsa Atm Wdl Fee Atm Withdrawal > > 01FEB19:25St.George Telopea Hair Telopea Nsw \Au Visa Purchase > > 29JANIkea Homebush Bay Rhodes Visa Purchase 28JANWoolworths > W1200 Eastwood Visa Purchase 29JANPlatinum Communicatn > Rhodes > Visa Purchase 29JANCaci Clinic Edgecliff Visa Purchase > > 31JANWorld Vision Of Aust Burwood E Visa Purchase > 28JANVodafone Chatswood Visa Purchase > 31JANWoolworths > W1200 Eastwood Tfr Wdl Bpay Internet 03FEB19:0311174893 > Integral Energy Could you send me the file without email linewraps? Or however it's done to post a .zip to the euforum? Or something else? > Each line is a single field in the incoming .csv file. > Any suggestions on how to parse it? In most but not all cases, the > second 'column' starts at about the 29th element. Sometimes the date > is given, sometimes data and time, sometimes nothing. I have an idea that's easy with strtok v3 (not uploaded yet, not finished either) parses(), if i understand what's there. Can you provide any more info on what's on each line? Kat
9. Re: Parsing problem
- Posted by Patrick Barnes <mrtrick at gmail.com> Aug 17, 2005
- 598 views
On 8/16/05, Kat <gertie at visionsix.com> wrote: > Could you send me the file without email linewraps? Or however it's done = to > post a .zip to the euforum? Or something else? No. This is my bank transaction history. > I have an idea that's easy with strtok v3 (not uploaded yet, not finished= either) > parses(), if i understand what's there. Can you provide any more info on > what's on each line? Thanks, but that's fine now, I've used the suggestions in the forum, an 'explode', 'trim', and a func to split at 2-or-more-space boundaries... and it works nicely. -- MrTrick ----------
10. Re: Parsing problem
- Posted by "Kat" <gertie at visionsix.com> Aug 17, 2005
- 593 views
On 17 Aug 2005, at 21:25, Patrick Barnes wrote: > > On 8/16/05, Kat <gertie at visionsix.com> wrote: > > Could you send me the file without email linewraps? Or however it's done to > > post a .zip to the euforum? Or something else? > > No. This is my bank transaction history. I meant only what you posted to euforum anyhow, nothing additional. > > I have an idea that's easy with strtok v3 (not uploaded yet, not finished > > either) parses(), if i understand what's there. Can you provide any more > > info > > on what's on each line? > > Thanks, but that's fine now, I've used the suggestions in the forum, > an 'explode', 'trim', and a func to split at 2-or-more-space > boundaries... and it works nicely. Ok, no pressing need for strtok v3 noted. Kat