OpenEuphoria: Forum: find repeated sub-strings

1. find repeated sub-strings

Posted by Lewis Townsend <keroltarr at HOTMAIL.COM> Apr 28, 2000
595 views

Hello all,

I need a function that finds the most repeated segments in a string.
For example:
If I had a string: "the quick brown fox jumped over the lasy brown dog"
our hypothetical function would find the repeated sub-strings:
" brown " and "the "
I would like this function to also keep track of how many times each
multiple match was matched; like so:
{{2," brown "}, {2, "the "}} -- prefered return format
Also, don't bother returning a string that is less than 2 characters
long. Am I making sense?
Does anyone have code that does this or something very similar?
As you might have guessed, it is for a compression algorithm I have
in mind but I am stumped at this first vital funtion.
I always run up against possible problems and try to redesign all
over again just to realize another possible flaw in my algorithm.

any help would be appreciated,

later,
Lewis Townsend
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

new topic » topic index » view message » categorize

2. Re: find repeated sub-strings

Posted by Kat <gertie at ZEBRA.NET> Apr 27, 2000
567 views
Last edited Apr 28, 2000

----- Original Message -----
From: "Lewis Townsend" <keroltarr at HOTMAIL.COM>
To: <EUPHORIA at LISTSERV.MUOHIO.EDU>
Sent: Thursday, April 27, 2000 10:06 PM
Subject: find repeated sub-strings


> Hello all,
>
> I need a function that finds the most repeated segments in a string.
> For example:
> If I had a string: "the quick brown fox jumped over the lasy brown dog"
> our hypothetical function would find the repeated sub-strings:
> " brown " and "the "
> I would like this function to also keep track of how many times each
> multiple match was matched; like so:
> {{2," brown "}, {2, "the "}} -- prefered return format
> Also, don't bother returning a string that is less than 2 characters
> long. Am I making sense?
> Does anyone have code that does this or something very similar?
> As you might have guessed, it is for a compression algorithm I have
> in mind but I am stumped at this first vital funtion.
> I always run up against possible problems and try to redesign all
> over again just to realize another possible flaw in my algorithm.

Look at the strtoks.e file in WIN Sockets Code for Mirc on the User
Contributions page. It was written specifically for playing with the words
in a sentence. You could:
(UNtested code)

wordnum = 1
wordcount = numtok(text, 32)
while ( wordnum < wordcount ) do
  oneword = gettok(text,wordnum,32)
  sprintf(1,oneword & : & space & findtok(text,oneword,0,32) & /n )
  inc wordnum
end while

That will list each word with how many matches there are for it. There is
also a wildmatch, a rem(ove)tok, a ins(ert)tok, etc for manipulating the
words in the sentence.

Kat

new topic » goto parent » topic index » view message » categorize

3. Re: find repeated sub-strings

Posted by Lewis Townsend <keroltarr at HOTMAIL.COM> Apr 28, 2000
576 views

Hello Kat,

>Look at the strtoks.e file in WIN Sockets Code for Mirc on the User
>Contributions page. It was written specifically for playing with the words
>in a sentence. You could:
>(UNtested code)
>
>wordnum = 1
>wordcount = numtok(text, 32)
>while ( wordnum < wordcount ) do
>   oneword = gettok(text,wordnum,32)
>   sprintf(1,oneword & : & space & findtok(text,oneword,0,32) & /n )
>   inc wordnum
>end while
>
>That will list each word with how many matches there are for it. There is
>also a wildmatch, a rem(ove)tok, a ins(ert)tok, etc for manipulating the
>words in the sentence.
>
>Kat

I need something that will ignore words and delimiters such as spaces and
cariage returns.
in this example: "the coyote ate the cat"
The segments "the c" and "te " would be repeated which doesn't considder
whole words. Does strtoks.e allow this sort of pattern
matching?

later,
Lewis Townsend
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

new topic » goto parent » topic index » view message » categorize

4. Re: find repeated sub-strings

Posted by Kat <gertie at ZEBRA.NET> Apr 28, 2000
565 views

----- Original Message -----
From: "Lewis Townsend" <keroltarr at HOTMAIL.COM>
To: <EUPHORIA at LISTSERV.MUOHIO.EDU>
Sent: Friday, April 28, 2000 1:32 PM
Subject: Re: find repeated sub-strings


> Hello Kat,
>
> >Look at the strtoks.e file in WIN Sockets Code for Mirc on the User
> >Contributions page. It was written specifically for playing with the
words
> >in a sentence. You could:
> >(UNtested code)
> >
> >wordnum = 1
> >wordcount = numtok(text, 32)
> >while ( wordnum < wordcount ) do
> >   oneword = gettok(text,wordnum,32)
> >   sprintf(1,oneword & : & space & findtok(text,oneword,0,32) & /n )
> >   inc wordnum
> >end while
> >
> >That will list each word with how many matches there are for it. There is
> >also a wildmatch, a rem(ove)tok, a ins(ert)tok, etc for manipulating the
> >words in the sentence.
> >
> >Kat
>
> I need something that will ignore words and delimiters such as spaces and
> cariage returns.
> in this example: "the coyote ate the cat"
> The segments "the c" and "te " would be repeated which doesn't considder
> whole words. Does strtoks.e allow this sort of pattern
> matching?

You can use the wildmatch to find tokens (words) containing "c" and to find
words with "te" in them. You'd need to write the loop, picking out what
parts of words you wish to look for. I suggest looking for the biggest parts
first. And be recursive. As will all compression schemes, the tighter the
compression, the longer it takes to compress it, cause the more particles
you haveto look for in the entire text.

> in this example: "the coyote ate the cat"

you'd scan for:
*the*
*th*
*he*
*coyote*
*coyot*
*coyo*
*coy*
*co*
*oyote*
*yote*
*ote*
*te*
*oyot*
*oyo*
*oy*
*yo*
*ote*
*ate*
*at*
*the*
*th*
*cat*
*ca*
*te*
....etc....

I'd compress the whole words, including words with afixes, then compress the
afixes. Only then would i compress the insides of the words.
Kat

new topic » goto parent » topic index » view message » categorize

OpenEuphoria

1. find repeated sub-strings

2. Re: find repeated sub-strings

3. Re: find repeated sub-strings

4. Re: find repeated sub-strings

Search

Include:

Quick Links

User menu

Misc Menu