Have you ever implimented a "related pages" engine?



  • For a project that I'm working on, we
    are going to have "pages" which represent one row of a
    table somewhere, and one of the requirements is that the system will
    dynamically figure out which pages are related.

    So for
    example, let's say that you're looking at a web page and the word
    "tuition" appears several times on that page.  The
    system should locate other pages that deal with tuition and provide
    links to them.

    I know, this is probably a pipe dream but I'm
    going to give it a shot.  I thought I would post here and see if
    anyone had any bright ideas. 

    Here's what I've come up with so far. First we'll index every page and produce a list of keywords along with the number of times a given word appears on the page. Finding pages with matching keywords is easy, the important part is sorting them so that the most relevant appear at the top of the list. So for that I'll use some kind of weighting system. Each time that a keyword appears in a page, it should get a little extra weight, with the marginal effect diminishing each time. Then I'll sum the weights on the source and target page and sort descending.


    What do you think?



  • @tofu said:

    Here's what I've come up with so far.
    First we'll index every page and produce a list of keywords along
    with the number of times a given word appears on the page. Finding
    pages with matching keywords is easy, the important part is sorting
    them so that the most relevant appear at the top of the list. So for
    that I'll use some kind of weighting system. Each time that a
    keyword appears in a page, it should get a little extra weight, with
    the marginal effect diminishing each time. Then I'll sum the weights
    on the source and target page and sort descending.


    You need another factor for the total number of pages that contain the word. If a word appears on every page, it means nothing. E.g. words like "the", "is", "a" are not relevant.
    This factor could be computed like this (provided you have more than one page):
    r=(t/c-1)/(t-1)
    r..relevance
    t..total number of pages
    c..number of pages containing the word



  • That's a good way to spot (is, a, etc.) but of course that means I'll have to keep an index of those words.  What would happen if I just ignored words shorter than four letters, had a hardcoded list of longer words to ignore, and then only indexed the most common 10 words of whatever is left?

    Also, I hate to pawn this off on anyone, but I want to go home for the day.  Can anyone think of a formula to produce this series:

    x   f(x)
    1   1
    2   1.8
    3   2.4
    4   2.8
    5   3

    My brain is mush right now.



  • Most content indexing systems call it a stop list and it's a common feature.  You might also want to at least give word normalization a shot -- ex. convert dried and dries to dry, but not dryad.  That way the search returns more accurate hits.

    You might also want to consider the intentions of the content creators.  If the content creators are in a position to try to unfairly influence the system then they are likely to put a bunch of bait words on a document.  If it's an internal system used simply for the convienience of everyone, then a word search might be fine.

    Finally, you might already own a system like this.  If you have Windows 2000 server or Windows 2003, then you already have a sophisticated indexing service that can be queried through ADO.  If you have SQL 2000 or later, then you have a system that can index and search database content.  Both are really easy to use, highly programmable and likely to be far better than anything you or I could come up with in 100 hours or so.



  • @tofu said:

    That's a good way to spot (is, a, etc.) but of course that means I'll have to keep an index of those words.  What would happen if I just ignored words shorter than four letters, had a hardcoded list of longer words to ignore, and then only indexed the most common 10 words of whatever is left?

    Also, I hate to pawn this off on anyone, but I want to go home for the day.  Can anyone think of a formula to produce this series:

    x   f(x)
    1   1
    2   1.8
    3   2.4
    4   2.8
    5   3

    My brain is mush right now.


    f(x) = f(x-1) + (1 - (x + 1)/5);

    or

    f(x) = 1.1x - .1x^2



  • @tster said:

    @tofu said:
    That's a good way to spot (is, a, etc.) but of course that means I'll have to keep an index of those words.  What would happen if I just ignored words shorter than four letters, had a hardcoded list of longer words to ignore, and then only indexed the most common 10 words of whatever is left?

    Also, I hate to pawn this off on anyone, but I want to go home for the day.  Can anyone think of a formula to produce this series:

    x   f(x)
    1   1
    2   1.8
    3   2.4
    4   2.8
    5   3

    My brain is mush right now.


    f(x) = f(x-1) + (1 - (x + 1)/5);

    or

    f(x) = 1.1x - .1x^2


    I was thinking more along the lines of a logarithmic equation along the
    lines of y = 1.2673 LN(x) + 1.  Although not specifically stated, I
    would assume x would be the number of times a word was found and y
    would be the weighting factor and the given points are a subset of the
    continuous function.  Given this assumption, the weighting factor would
    reach a asymptote around 6 at high frequency.  If you needed to
    compress/adjust the curve, you could add a left shifted decreasing
    exponential to the logarithmic function.



    Larry



  • @jsmith said:

    Most content indexing systems call it a stop list and it's a common feature.  You might also want to at least give word normalization a shot -- ex. convert dried and dries to dry, but not dryad.  That way the search returns more accurate hits.

    I believe WordNet (http://wordnet.princeton.edu) can give you the root of a word for you, if you want to try that out.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.