Simple search



  • Heyall,

    My google Fu is weak for what I think should be a simple problem.

    • 1 I have a list of Words (in Excel), lets say spare parts
    • 2 I have a set of .docx documents (In multiple Folders)

    I want to know:

    • which words of the list (which spare part) is mentioned in what document,
    • which spare part is not mentioned

    A manual search word by word is not feasible (800+ words, 200+ .docx), reccuring task .

    I am certain that this is a solved problem, however I can not find a solution...

    What would you recommend, It can work either on windows or OSX. Any ideas?

    thanks!


  • Notification Spam Recipient

    You'll likely need to convert your docx files to plain text to reduce processing. Something like this might help get you started:

    Same thing for your list of search terms, though that should be easy to copy-paste into Notepad or something.

    Then, you can use a utility like FNR or Powershell's Select-String command to open all the files and searching them for those terms.

    I suppose it can be done in MacOSX but extracting all the text from DocX files there might be slightly less automatable.



  • @Kurt-C-Pause I'm pretty sure you can do that with a VBA script attached to your Excel document, but I'd have to refresh my own memory to be able to offer exact details on how to do it. I haven't done much with VBA, but that seems like a perfect application for it because all the data manipulation can be done with native calls without having to convert any files.

    I do know you'll have to start by enabling Developer Tools in Excel if you haven't already in order to get to the script editor.

    Recording a macro to copy one cell, open a file, and search for the word would provide a good start. Then the macro script can be edited to loop over all the words and through all the files in given directories, get the search success/fail value, and save the filename if found.


  • Considered Harmful

    Sounds like a job for SSDS.



  • @Tsaukpaetra said in Simple search:

    I suppose it can be done in MacOSX but extracting all the text from DocX files there might be slightly less automatable.

    Word for Mac is AppleScriptable, so you can tell it to open each document in turn and find the words you’re looking for.


  • :belt_onion:

    @Kurt-C-Pause
    1/ First convert the Word documents to HTML using the Office Primary Interop Assemblies (PIA), e.g.

    Note that you need to have MS Word installed on the machine running this code.

    2/ Then you can run a tool like HTMLTidy on the result to clean it up

    Now you should have nice, clean XHTML versions of your documents.

    3/ Load the XHTML files in an XmlDocument or load them as free text.

    4/ Load the Excel file through using ODBC

    5/ At this point, you can reference all data you need from code so you can upload into a search utility (e.g. SSDS) or write one yourself


  • BINNED

    @bjolling

    6/ Remove your gloves
    7/ ...
    8/ Profit!


Log in to reply