Not sure what to do.... new job



  • Hi I was laid off a couple months ago, and started a new job just today.

    I graduated last year, and this new job was advertised as a new grad job.

    The pay is a bit lower than I was hoping, but the environment seems extremely nice, and the coworkers are nice.

    By the way, this is a web development position that uses mainly PHP.

    Anyways I had the standard tour around the system this morning, and this afternoon I was handed a brief for my first project.

    I take a look at it and I am completely shocked.

    Basically what they want me to do is create Web crawler that downloads certain PDF files (all legal stuff not illegal).

    Now I have never done anything close to this before, and have no idea how to even tackle this. What they want is very specific. They have a list of sites they want me to crawl (about 200 or so) and I am suppose to download all pdfs from the sites. (Of course each site is formatted differently... some just have a list of all there pdfs which would be easy to parse, but other sites are search engine based)

    During the interview, they didn't indicate at all that this was my first project. They said they would start me off with something simple (a basic add update delete application)  then work my way up.

    They do have one other programmer here (the guy who actually gave me the brief) and he himself said this might be a bitch....

    I am just debating exactly what I should do. I don't want to be a Paula and after a couple months all I have to show is absolutely nothing... but I don't want to quit either and spend another who knows how long to find another job. 

    Any insight on what I should do (I had one previous job, and while it was challenging, it wasn't so hard where I had no idea what I have to do).

    I will try to figure out how to even start the project for the rest of today and maybe a bit tomorrow... but right now I am leaning on just saying I have no idea how to do this.

     


     



  • This doesn't seem too bad to me, I think you should stick around and go for it.  Do you have to use PHP?  It seems like this would be better handled by Python or Ruby.

    In Python you can do something like:

    urlopen('somewebpage.html').read()

    to read a web page.  It looks like that gives you a file handle, so I think you could use it to save off the .pdf too.

    See: http://docs.python.org/lib/module-urllib.html for more on urlopen.

    The program would look like a standard web crawler - it would start with your seed urls and then follow every link until you find the .pdf's that you need.  Of course, if you have no idea on how to do this, you should tell the company that, so they have realistic expectation about how long it will take you, but I think you should stick with it.  It might even be fun!


     



  • Oh I agree it might be fun, once I figure out how to even start.

    And yeah, they want everything done in PHP.

    I know how to get all the content from a page (file_get_contents) and parsing just one page wouldn't be bad.

    The problem is I am going to have to parse hundreds of pages all from different companies, all formatted differently. And it isn't quite as simple as check for pdf and download it, because each with each pdf there is also a title somewhere on the page for it, and I am going to need the title to (and the filename doesn't always contain the title)

     I definitely want to stick with it, because I don't like to be a quitter, I'm just trying to figure out how to even handle this.

    I will plug away at this rest of today and tomorrow, then look at my progress and I guess figure out what to do from there. 

    I have only worked with PHP for 4 months  before, and only worked with web development for about 8 months total. And none of it involved web crawling and parsing pages... heh this will definitely be a nice challenge though 



  • Overcomplicated. You can do this all with a single wget command. Writing any code for it is a complete waste of time.



  • Doh, asuffield to the rescue again.  Yeah, go with wget.



  • thanks a ton. this looks like it may do everything I want.

    The only thing im not sure is how to retrieve files from search based sites.. but maybe it isn't possible at all. 



  • @accident said:

    The only thing im not sure is how to retrieve files from search based sites.. but maybe it isn't possible at all. 

    If wget doesn't handle them, then it's too complicated to be practical. You can't really extract any more structure from a site than the author put into it (and if you could, you would be (a) earning ten times more than you do, and (b) working for Google). All problems based around spidering web sites are either really easy or really hard.



  • my advice is anytime you need to do something that seems like it is very common (web crawling is very common) you should look for something that exists that does it for you.

     

    For example, in Perl you have HTML::LinkExtor within CPAN.    I'm sure PHP has something similar.  However as assufeld pointed out, wget will cover you here.



  • @asuffield said:

    Overcomplicated. You can do this all with a single wget command. Writing any code for it is a complete waste of time.

    That's exactly what I thought of too :)

    Here's an example that should get you started:

    wget -nc -t 20 -r -nd -A.pdf -i textfilewiththe200urlstosearch.txt

    -nc means don't overwrite anything you already have (you can run this command again if it fails for some reason and it won't re-download things)

    -t 20 is retry 20 times before giving up

    -r means recurse through the website

    -nd means don't preserve the directory structure of the files, just put all PDFs in the same directory named after the domain from which they came

    -A.pdf means only get PDF files

    -i text.txt means look in this text file for all of the URLs each on their own line that you want to search through.

     
    :) Untested, but that should work well enough, wget is great.

    -Jesse
     



  • Thank you for all your replies. This seems to get pretty well most sites I want (search based sites I can't but oh well, my employeer will just need to deal with it)

     One problem I seem to have though, is http://www.agrium.com/products_services/msds/ site.

    There are 3 sections, Dry products, liquid products, specialty products. I want to download all pdfs on each on of these sections.
    The problem I am having is wget doesn't seem to crawl into those links no matter what I try. I am guessing it is most likely cause it is a JSP page.
    SO for example, I was wanting it to crawl to dry section, and then download all the PDFs on that page, then move to the liquid section and do the same.
    Now I could try linking to the page directly, but I need all 3 pages and want to do it in 1 command.  I know I could include  the three separate links but was wanting it to crawl to each one.

    Wget seems to work with ASP pages, however I have to seem to put  -A *.asp,*.pdf . If i don't download the asp pages it wont crawl the links on it for some reason.

    But thanks again for suggesting this wget. it saving me a bunch of headaches.
     



  • @accident said:

     One problem I seem to have though, is http://www.agrium.com/products_services/msds/ site.

    There are 3 sections, Dry products, liquid products, specialty
    products. I want to download all pdfs on each on of these sections.

    The problem I am having is wget doesn't seem to crawl into those links
    no matter what I try. I am guessing it is most likely cause it is a JSP
    page.

    Study the wget documentation for recursive mode carefully (not just the manpage). All it does is fetch each page, extract every link, and follow all the ones that match its current criteria. It has no understanding of what links or filenames mean, it just sends them back to the server and looks to see whether the result has a content-type of text/html or application/xhtml+xml. Since you're using -A as a filter, and .jsp isn't on your accept list, it doesn't follow those links. You could add .jsp to the list or add those three URLs directly to the input list. Same thing happens with .asp.

    Be wary of letting wget recurse through things like .jsp, .asp, or .html because then it will tend to go walk over the entire site, when you probably only wanted to visit a small area. Use other criteria like -np or -l to inhibit this.



  • For the "search" based sites, why not use google?

    as in "site:adobe.com type:pdf"

    Then extract the links to the pages from the google search results. 



  • @stratos said:

    For the "search" based sites, why not use google?

    as in "site:adobe.com type:pdf"

    Then extract the links to the pages from the google search results. 

    I don't know if they still provide it, but Google used to have a search API you could use (after a free registration). That way, you wouldn't have to parse the HTML to get to the results.  



  • There's actaully two api's.

    If you have a old key you can still use the old one, otherwise you will have to use the new one.
    the old one was pretty simple, dunno about the new one, since i still have a old key :)

    although haven't made use of it for some time now. 

    I think the new one is java only, while the last one was just rpc/soap sum-thing-or-another.
     


Log in to reply