Programming Shootout: reverse-complement

new topic     » topic index » view thread      » older message » newer message

We seem pretty close to filling up all the benchmarks in the Programming 
Shootout. I've just come back to Euphoria after trying it a few years ago,
liking it, but ultimately leaving it because it wasn't widely used or 
supported. Now, I'd like to try to change that; I still like the language.
So to help refresh myself I did this one, which is based on the fasta
benchmark.

There are probably some ways to further optimize it, but I think I've done
an OK job there. Not as good on code style though. The "anticipate certain
subsets of data" optimization might disqualify it too, but that wouldn't be
hard to rip out.

I've only tested it in Windows but it will probably work fine on Linux.
(I'll find out soon since I've got a new distro to try)

--reverse complements of DNA sequences (for the Great Programming Shootout)
--By James W. Hofmann 2007
--This code is public domain code. You may reuse, redistribute or modify it
--however you wish.

sequence argc
sequence in,dseq,headers,lseq
integer char
integer file
integer head -- the header we're on
integer charcount, newlinecount
integer ALLOCATE -- guessed length of input sequence
constant MAXLINE = 60 -- specified line length
ALLOCATE = 10000 -- guess how many chars needed; adjustable for different N-size
in = {repeat(0,ALLOCATE),repeat(0,ALLOCATE),repeat(0,ALLOCATE)}
headers = {{},{},{}}
dseq = {{},{},{}}

-- optimization for each of the three sequence types

dseq[1] = {"ACGT",
        "TGCA"} -- uppercase only
dseq[2] = {"acgtMRWSYKVHDBN",
        "TGCAKYWSRMBDHVN"} -- lowercase acgt, uppercase ambiguous codes 
dseq[3] = {"acgt",
        "TGCA"} -- lowercase only

lseq = {1,1,1} -- real length of each sequence
head = 0
charcount = 0 -- overall position in a sequence's output
newlinecount = 0 -- we have to recreate newlines when reversing
argc = command_line()

file = open(argc[3],"rb")

function getComp(integer code, integer headerstyle)
		for m=1 to length(dseq[headerstyle][1]) do
			if code=dseq[headerstyle][1][m] then
				code=dseq[headerstyle][2][m]
				return code
			end if
		end for	
		return code -- don't change unrecognized stuff
end function

while 1 do
    char = getc(file)    
    if char=-1 then
        exit   -- EOF
	elsif char='>' then
		head+=1
		while char!='\n' do -- write header
			if char>13 then -- take only printable stuff here
				headers[head]&=char				
			end if
			char = getc(file)
		end while
		headers[head]&=char -- add LF (but not CR)
	else
		if char>13 then -- a real character
		    lseq[head]+=1
		    in[head][lseq[head]]=getComp(char,head) -- convert chars one at a time
		    -- we will add newlines when we write
		end if
			
		if lseq[head]>=ALLOCATE then -- anticipate more space being needed
			in[1] = in[1] & repeat(0,ALLOCATE)
			in[2] = in[2] & repeat(0,ALLOCATE)
			in[3] = in[3] & repeat(0,ALLOCATE)
			ALLOCATE *= 2 -- this expands the expected allocation with larger requests
		end if

	end if
end while

close(file)

file = 1

for seqnum=1 to 3 do
	
	puts(file, headers[seqnum])

	newlinecount = lseq[seqnum]-MAXLINE+1
	for n = lseq[seqnum] to 1 by -1 do
		puts(file, in[seqnum][n])
		if n<=newlinecount and n>2 then -- n>2 lets the end of the sequence 
										-- terminate naturally
			puts(file,'\n')
			newlinecount-=MAXLINE
		end if
	end for

	puts(file,'\n')

end for

close(file)


new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu