Re: find repeated sub-strings

new topic     » topic index » view thread      » older message » newer message

Hello Kat and everyone else,

>You can use the wildmatch to find tokens (words) containing "c" and to find
>words with "te" in them. You'd need to write the loop, picking out what
>parts of words you wish to look for. I suggest looking for the biggest
>parts
>first. And be recursive. As will all compression schemes, the tighter the
>compression, the longer it takes to compress it, cause the more particles
>you haveto look for in the entire text.
>
> > in this example: "the coyote ate the cat"
<SNIP>
>I'd compress the whole words, including words with afixes, then compress
>the
>afixes. Only then would i compress the insides of the words.
>Kat

Thanks for the suggestions Kat but Michael Nelson provided me
with a function that does just what I asked... And with
slight modification and addition does just what I meant to ask
for. :)

Just in case you or anyone else is interested, I was wanting
this function for a compression algorithm I have been working on.
Here is how it works:

It finds all substrings that are repeated in a string of bytes
and sorts them by how many matches were found in descending order.
(This was done with Michael's code) Then I re-sort these strings
based on a "score" which is calculated partly as the number of
repeats times the length of the string. This score represents
the amount of compression any particular substring will provide
if removed. Then I delete any substrings from the list that that
are just slices of others starting at the lowest score going to
the highest this way I delete the low scores before the higher
ones. After this, I go through the original string and find any
and all bytes that are NEVER used. The first of these is used as
a delimiter. The rest are used to replace the longer repeated
strings in the text. I store one copy of each repeated string in
a "header" string. This is where the delimiter comes in handy. I
store the header in this format:
delimiter & unused_char & repeated_str & delimiter etc...
After the last delimiter I store the new compressed string.
What do you all think of this scheme. I have got up to 17%
compression with this. Has anyone done this before? Can
anyone see any potential flaws and/or innefficiencies? (I've
weeded out some inefficiency already by altering the scoring
mechanism and compression code but can anyone see any others?
One that I have noticed but not fixed yet is that it is
extremely SLOW on long strings, I will look into optomizing
it.

I will post the code as soon as I get a DEcompressor coded.
This will tell me if there is any blatant data corruption and/or
loss in the current algorithm - before I post bad code :). I was
planning on finishing the decompressor BEFORE posting this message
but I forgot my disk at home. :(

thanks for reading and any feedback,
later,
.     __       _____  __    __    __    __    _____
.    /\ \     /\    \|\ \  /\ \  /\ \  /\ \  /\    \
.   /  \_\   /  \____\ \_\/  \_\/  \_\/  \_\/  \____\
.  /   / /  /   / ___/ |  |  / |   / /   / /\  /  __ \
. /   / /  /   / /_\ | |  | /  |  / /   / /\_\/  /_ \/
./   / /\ /   / ___/ | |  |/   | / /   / /\ \ \__  \
.\  / /__\\  / /__\  \ |    /| |/ /\  / /  \/\_\/  /
. \/_____/ \/_____/   \|___/\|___/  \/_/    \_____/
keroltarr at hotmail.com  http://geocities.com/keroltarr/
*Save the trees* = ^Don't^print^this^sig^ ;)
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

new topic     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu