Re: Clustering (was: Neural Network Customer)
- Posted by jguy <jguy at ALPHALINK.COM.AU> Feb 21, 1999
- 457 views
noah smith wrote: > --Noah's comment: > If i remember correctly, what he's talking about here is seperating cells into > 2 groups. There > are eight "factors" which can describe a given cell, some which determine if > the cell is > abnormal. The problem is, there are 3 types of cells, "normal" > run-of-the-mill cells, "stem" or > really good cells, and "cancer" or really bad cells. They want stem cells, > and don't want > cancer cells, but the same data which defines a stem cell appears to also > define a cancer cell. > They have to find a "line" (the "line! > " is a seven-dimensional structure) which seperates the > eight dimensional data set into 2 groups. That reminds me of a siimilar problem in my field (linguistic decipherment). A word can belong to more than one cluster. E.g. "plant" belongs at least to two semantic cluster: botany and manufacturing. The stumbling block I had encountered was that clustering algorithms were designed to identify points as members of one cluster and only one, or to assign them "in between" adjacent clusters. When in reality, in my field, a point can very well belong to two or more clusters which are far apart (I called them "disjunct clusters", by analogy with "disjunct" morphemes). That was quite some time ago. I fell into the usual traps: unwittingly overfitting the data, using neural nets (with such data, they never reach a stable state, but keep "changing their minds" -- it's quite funny, you feel you are dealing with a hopeless human being. Much later, I think it was early last year or late in 1997, during an exchange on the Voynich interest group (it's about an undeciphered Medieval manuscript), I hit upon the seed of a clustering algorithm that would allow points to belongn to more than one cluster. I tested it by hand and it seemed to work. I left it at that, because then, I was routinely using Borland Pascal and the memory allocation and disallocation was just too much of a nightmare for me to contemplate. (To get interesting results on the Voynich language, I would have needed 2000x2000 matrices). Whereas now that I have become rather fluent in Euphoria... The other thing, perhaps worth repeating, is that I did try neural nets on that sort of data, and that I fell flat on my face. They didn't work for me.