Document search

  • Context: selecting candidates for a job, by filling in a form in a database front-end.

    Users of the application are presented with the list of candidates. They can filter the list both by specifying certain attributes of the candidates, like age, desired income, location, etc., and by specifying a "free-form" search which will be applied on the candidates' resumés and other documents.
    While the candidate attributes are each on its own input field, the free-form search is a single string which is meant to be a sort of a WHERE clause of an SQL statement. So you could write something like 'hot & (rich | single)', for example.

    How does the system work? Not too bad, now we've introduced this new technique called "indexation", but a few weeks ago the process went something like this:

    1. The user inputs information. Create an SQL SELECT on the candidates table based on the attributes specified by the user. This is pretty straightforward. Don't run the SQL just yet, though.

    2. Run a query to obtain [i]all[/i] the resumés/documents on the system and their respective candidate owners.
       2.1. Now search the documents.
       They are saved as files on the disk, their filenames being their GUIDs on the database. This is to avoid filename collision (candidates tend to send files called something like 'resume.doc', for some reason). These are also all on the same directory, regardless of the candidate who submitted the document or the job the candidate is applying to.
       So you have a directory with hundreds of thousands of files, whose names are each an unintelligible sequence of 36 chars.
       And yes, you are going to iterate over each of them.
          2.1.1. But wait, most files are not clear text, so you have to extract the text from them.
          This is done by running one of several external programs. You decide which program to call by looking at the extension of the original filename - not on the detected MIME type, which, incidently, would be the next column on the documents table.
          But because we're not [i]that[/i] dumb, we cache the converted files. The results are stored as new files with the same filename (36-char GUID) plus an added .txt file extension. On the same directory.
          2.1.2. Read the whole file into a string.

          2.1.3. Parse the free-form search query into a postfix, array version.
          So 'hot & (rich | single)' becomes array('hot', 'rich', 'single', '|',  '&').
          Yes, we do this for each iteration.
          2.1.4. Receive the array above in a parameter with a name beginning with 'str', and process it:
             Iterate over each token, comparing it (in lowercase) to '&' and '|'.
                If it's not '&' or '|', see if it is contained in the current file (use strpos, this time with the non-lowercased token):
                    Substitute with "1 == 1" if it exists in the file, with "1 == 0" if it isn't.
                If it is '&' or '|', substitute with its PHP equivalent ('&' becomes '&&', '|' becomes '||') and add to a result string surrounded with its two previous tokens and parenthesis.
             So at the end of this process, the result would be ('hot' && ('rich' || 'single')) with the strings 'hot', 'rich' and 'single' substituted with either '1 == 1' or '1 == 0'. Store this result in $sTmp.
          2.1.5. eval("if($sTmp) $itmp = '1'; else $itmp = '2';");
          2.1.6. If itmp == 1 then we have a match! Add the owner of this document to the list of "found" candidates.
          2.1.7. Keep iterating over files.

    3. We've iterated over all the files and now we have a list of "found" candidates, ie candidates with documents that are search hits.
    Add an extra condition to the SQL statement of step 1 (remember that?) to constrain on the "allowed" candidate GUIDs.

    4. Finally run the SQL and obtain the final list of candidates.

    Along this code there were globals variables called $dick_operands and $dick_logiclist, to name a few.
    A helpful comment explained: "we add prefix not to litter the global namespace".

    The documents were coming in by email.

    Every 10 minutes, the system would:
       1. Download all unread messages, and iterate over them;
       2. Skip email messages that have already been processed, by checking if their message IDs (as returned by the mail server) are already stored in the database.
          For example, make sure there aren't already any emails with id = '' in the database;
       3. Strip malicious tags from the email body. Things like 'object', 'applet', 'java', 'onclick', etc.
          (By "tags from the email body" I actually mean "every single instance of those strings in the whole raw email, including its headers".)
       4. Save attachments to the working directory of the search process above. Also create their respective entries in the documents table.
       5. Save an entry in the emails table for the message just processed
          (it was '', wasn't it?)
       7. Repeat from step 1

    (missing step 6: mark the message as read)

    This post became enormous as I was writing it.
    I hope your time reading it was worth it.

    And no, I didn't make this up, although I wish I had.

  •  The Gogglez...They Do Nuthing


    Seriously, that does sound like a WTF-worthy way of doing it.

  •  Sounds alot like a project I did.  Not quite as WTFy.  We had a dictionary of terminology specific to that field of study.  We had a metric boatload of .html pages, that other people would edit.  In order to get some nice highlighting/hoverover of definition,  I would open the root directory, then:


    Check if file was newer then last search date. (So we dont spend to much time on old files that haven't been updated.)

     Pull in file into string.

     Inner loop:

     Pull in a dictionary term.

    Search file for exact string match

    Make sure it wasn't a link, title, or javascript, and located inside the body tag.

    Make sure it was not just a partial match. (Since some "terminology was a single letter!")

    But make sure it matched if it ended in -ed, -s, -ing, etc.

    If so, replace term with a link to our wiki, and a hoverover with the definition.

    End inner loop

    If changes were made, write it back out.

    End outer loop.

    Oh then added in code the check if the wiki had updated a term.  If the change date on a term was newer then last search date, open every single file up again.  That was a pita, so lets create a db of last change date, and keywords within a .html page, so we only have to open that page up if change date is newer or a term was updated.

    What a mess.

  •  If God had wanted us to search documents in this manner, He would never have given us inverted files.

    (See IBM product STAIRS and its predecessors, 30-40 years ago.)

  • Haha you jokers are searching your files like that!

     With SSDS all the files and the words are stored in just ONE file, you dont have to keep opening lots of files and cashing them on your harddisk because it is just ONE file. YOu just have to search ONE file, none of this stupid SQI rubbish!

     Ha this is why I am the writer of a successful product and you are not sir.

  • @Steeldragon said:

    The Gogglez...They Do Nuthing
    The lead-lined nuclear bunker... it does nothing.

    One word: Brillant.

  • @menta said:

    Ha this is why I am the writer of a successful product and you are not sir.

    Plus, if you ever need your search to run faster, you can just open up the source code and JAM IT.

  •  WTF #1: Allowing employers to filter by age.  That is seriously illegal most of the time in the US.

  • @jcoehoorn said:

     WTF #1: Allowing employers to filter by age.  That is seriously illegal most of the time in the US.

    As I understand it (IANAL), it is legal in a few very particular circumstances, even within the US:

    • If the job involves the dispensing of alcohol, it is permissible to use the age filter >= 21.
    • If the job is within the sex industry, it is permissible (required, even) to use the age filter >= 18.
    • It is permissible for jobs outside of the sex and alcohol dispensing industries to use the age filter >= $minimum_age, where $minimum_age varies by state law, but is 16 or less in all instances I'm personally familiar with.

    I suspect that there are other exceptions as well.  As the old joke goes - students just entering law school, when asked whether something is legal generally have to say "I don't know."  But graduating from law school, they're able to say, with confidence, "It depends."

  • Shouldn't it be (hot | rich) & single?

  • @Zecc said:

    So you could write something like 'hot & (rich | single)', for example.
    @lolwtf said:
    Shouldn't it be (hot | rich) & single?

    I think Zecc knows exactly what he means, and I think he has it the right way round too.

  • @jcoehoorn said:

    WTF #1: Allowing employers to filter by age. That is seriously illegal most of the time in the US.
    Surprisingly no. It's illegal to filter "age ≤ 40" (or any number above 40), but that's about it.

  • @TwelveBaud said:

    @jcoehoorn said:
    WTF #1: Allowing employers to filter by age. That is seriously illegal most of the time in the US.
    Surprisingly no. It's illegal to filter "age ≤ 40" (or any number above 40), but that's about it.

    Actually it's not even illegal to do that.

    During layoffs, you have to know which ones are over 40 and send only them a letter stating the ages of the people laid off.  i.e. Last time I went through one I got a letter stating that x number of people were between age 18-25, y number between 25 and 40, and z number over 40.  This letter was sent only to those people in the 40 and over bracket.

    It is only illegal to make decisions about employment, pay or benefits based off of age in most cases (exceptions listed in a previous post), it is not illegal to run reports or filters based off of age.  How do you think statistics are built on the age of the workforce in different industries?

Log in to reply

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.