I'm in the midst of a screen-scraping marathon. It's nothing I can't handle, but I just got stuck on a site that was possibly the most annoying yet. Although that doesn't sound right, because it caved in too easily. Let me run through the little obstacles that I ran into:
- AJAXy .Net site -- not the world's worst thing, but they're awfully noisy when you record them. The nice thing about .Net is that it's really predictable and easy to pull the data out of. If you get your viewstate right you can do just about anything easily. But it's loud, large page sizes, and that viewstate can get huge.
- Fake screenshots being passed back? This one was weird. They were running through the form and recreating the HTML behind the page by moving from element to element, and mashing that into a huge text chunk. And then they had the gall to Base64 encode it before POSTing back on every single page - the image of the previous page.. Ok, so I had to do a little scripting to get that working, but who cares anymore, right?
- The sheer size of the pages and all of the posted back crap made the proxy recorder in screen-scraper give up. I'd never seen that before. It still worked, but it just decided it had had enough and stopped telling me what was happening. So I had to pull out tamper for Firefox (I know, right?) and record the session and then hand extract the http request/response sections. What is this, the dark ages?
- Did I mention it was a mainframe login? Yeah, when you log into the site you jump through a couple hoops (in new windows, of course) which logs you into a mainframe or vax or something, and their site is just scraping that session. So when I scrape the site I'm scraping scraped pages. Scrape-itty Scrape-itty Scrape.
- Pay-Per-Minute. Probably related to the mainframe. But it makes that whole "and then we log off" part really important. So my scrape doesn't rack up hundreds of dollars of charges. Ended up being ok, as I can scrape through 64 pages (maximum size of a search result set) in less than a minute. Pretty sure they're losing money on that one. I don't know how long it would take me to page through 64 pages of that crud by hand with a browser. And who would want to, anyway?
- Another minor annoyance is split parameters. Instead of asking for "widget ID number" they want the "Widget ID number" and the "Widget year number" and the "Widget size" all passed separately. WIth no way to get at the Widget easily. So have to script up a little bit of string splitting logic. It's no big deal, but it's just another grain of sand in the swimsuit of scraping.
- Multiple types of paging. This one is a new one. So you search for something and it comes back in pages 1, 2, 3, etc. Fine, I have stuff that does that nicely. But beyond that there is another page on the inside (the detail) that has multiple sections with small slices of data that use paging. You have to select (but not actually select -- just "kinda" select) an item in the small slice of data and then click on the "page down" link on the left hand side of the page. Oh that's if there's more than a little bit of data for that section. Could be 1 item, could be 7 items, could be 300 items (that's 100 mini-pages, if you weren't counting). So sometimes that paging exists, and sometimes it doesn't. Oh and you can always hit "page up" or "page down" like you would for your paging, but if you get it wrong, it's going to take you to either another page, another data page from the same item, or... back to your search results. So don't get that one wrong, because then you lose all of your context for all the other paging you're doing.
- You know that last point about multi-paging? Yeah, there's a second section on the same page that has the same "maybe, maybe not" paging. More scripts.
- .. and there's another page you have to go to beyond that little nightmare page. Guess what? Yeah, you got it. More paging. Only this one starts on the LAST page of the data, and you can't page up and down with a button. You have to figure out how many pages total there are, and then request them one by one. Don't mess it up, either, or you'll end up on another page completely. Scripty Scripty Scripty Script.
Sorry, I just needed to get that out. :)
The stressful part was figuring out what needed to be done, and then (worse yet) figuring out how to do it. There's no book, no google search, no co-worker to ask. I described it to Eli as "trying to glue a statue together in the dark" -- but when the lights go on it has to be perfect or it falls apart again. If you get something wrong all you get is a generic error or a blank page. And one more footfall in the dark that might get heard by the wrong person and get you blocked from the site.
After I got it all finished, I knocked out a dozen scrapes of easy sites this evening just to cleanse my pallet. It was so nice to know what I was doing again. Boring, but satisfying.
Love it.