Aug 23, 2012

Screen Scraping

I don't have anyone I can talk to about my work, so sometimes I have to talk to myself.  Like this.

I'm in the midst of a screen-scraping marathon.  It's nothing I can't handle, but I just got stuck on a site that was possibly the most annoying yet.  Although that doesn't sound right, because it caved in too easily.  Let me run through the little obstacles that I ran into:

  • AJAXy .Net site -- not the world's worst thing, but they're awfully noisy when you record them.  The nice thing about .Net is that it's really predictable and easy to pull the data out of.  If you get your viewstate right you can do just about anything easily.  But it's loud, large page sizes, and that viewstate can get huge.
  • Fake screenshots being passed back?  This one was weird.  They were running through the form and recreating the HTML behind the page by moving from element to element, and mashing that into a huge text chunk.  And then they had the gall to Base64 encode it before POSTing back on every single page - the image of the previous page.. Ok, so I had to do a little scripting to get that working, but who cares anymore, right?
  • The sheer size of the pages and all of the posted back crap made the proxy recorder in screen-scraper give up.  I'd never seen that before.  It still worked, but it just decided it had had enough and stopped telling me what was happening.  So I had to pull out tamper for Firefox (I know, right?) and record the session and then hand extract the http request/response sections.  What is this, the dark ages?
  • Did I mention it was a mainframe login?  Yeah, when you log into the site you jump through a couple hoops (in new windows, of course) which logs you into a mainframe or vax or something, and their site is just scraping that session.  So when I scrape the site I'm scraping scraped pages.   Scrape-itty  Scrape-itty  Scrape. 
  • Pay-Per-Minute.  Probably related to the mainframe.  But it makes that whole "and then we log off" part really important.  So my scrape doesn't rack up hundreds of dollars of charges.  Ended up being ok, as I can scrape through 64 pages (maximum size of a search result set) in less than a minute.  Pretty sure they're losing money on that one.  I don't know how long it would take me to page through 64 pages of that crud by hand with a browser.  And who would want to, anyway?
  • Another minor annoyance is split parameters.  Instead of asking for "widget ID number" they want the "Widget ID number" and the "Widget year number" and the "Widget size" all passed separately.  WIth no way to get at the Widget easily.  So have to script up a little bit of string splitting logic. It's no big deal, but it's just another grain of sand in the swimsuit of scraping.
  • Multiple types of paging.  This one is a new one.  So you search for something and it comes back in pages 1, 2, 3, etc.  Fine, I have stuff that does that nicely.  But beyond that there is another page on the inside (the detail) that has multiple sections with small slices of data that use paging.  You have to select (but not actually select -- just "kinda" select) an item in the small slice of data and then click on the "page down" link on the left hand side of the page.  Oh that's if there's more than a little bit of data for that section.  Could be 1 item, could be 7 items, could be 300 items (that's 100 mini-pages, if you weren't counting).  So sometimes that paging exists, and sometimes it doesn't.  Oh and you can always hit "page up" or "page down" like you would for your paging, but if you get it wrong, it's going to take you to either another page, another data page from the same item, or... back to your search results.  So don't get that one wrong, because then you lose all of your context for all the other paging you're doing.
  • You know that last point about multi-paging?  Yeah, there's a second section on the same page that has the same "maybe, maybe not" paging.  More scripts.
  • .. and there's another page you have to go to beyond that little nightmare page.  Guess what?  Yeah, you got it.  More paging.  Only this one starts on the LAST page of the data, and you can't page up and down with a button.  You have to figure out how many pages total there are, and then request them one by one.  Don't mess it up, either, or you'll end up on another page completely.  Scripty Scripty Scripty Script.
Sorry, I just needed to get that out.  :)

The stressful part was figuring out what needed to be done, and then (worse yet) figuring out how to do it.  There's no book, no google search, no co-worker to ask.  I described it to Eli as "trying to glue a statue together in the dark" -- but when the lights go on it has to be perfect or it falls apart again.  If you get something wrong all you get is a generic error or a blank page.  And one more footfall in the dark that might get heard by the wrong person and get you blocked from the site.

After I got it all finished, I knocked out a dozen scrapes of easy sites this evening just to cleanse my pallet. It was so nice to know what I was doing again.  Boring, but satisfying.

Love it.

Aug 19, 2012

Sunday

After another night of sleeping like the dead, we awoke to steady rain.  We had had our setbacks on the trip, but when it's your last day of vacation, the only thing ahead of you is the long drive home, and it looks like breaking camp is going to happen in a constant rain, you really just want to go back to sleep.

So we did.

By the time we all got up the sun had come out and while everything was wet, it was certianly happier than it had been.  We made it out before noon and began our long drive.

Stopped in Bessemer to pick up some pasties just before we left Michigan.  Jill went in and apparently the woman selling them was so old she was afraid she wouldn't make it through the sale.  I'll bet she'll still be there and just as old in ten years.  Hurley, the town just across the border, has more bars per capita than any other town in Wisconsin.  It was incredible!  Felt like Wisconsin Dells for skid row bums.


We made our way to Mellen and decided to lunch on pasties there.  Finding a scenic overlook (only 86 steps up, pfft, that's nothing!) with a wonderful view, we devoured the pasties.  I don't know if it was hunger or the end of the vacation, but that was the most flavorful, tastiest pasty I'd ever had.


When my sister and I were in middle/high school we went with our parents on a camping trip to see Bayfield/Madeliene Island and Northern Wisconsin.  We camped in one of the many National Forest Campgrounds and had a great time.  At one point Dad said "this would be a perfect trip if it weren't for the kids" -- he meant the campsite near us with loud kids, but somehow my sister thought he meant us.  I think it was years later than we cleared that one up.  I also remember being concerned about bears, but was assured by Mom and Dad that there were none and even if there were, they were black bears and they don't attack and kill people like grizzlies do.

The next week I found a newspaper article about a fatal black bear attack in Mellon WI.  I kept that in my wallet for years.  A satisfying "told you so."

The last notable sight on our way home was a collection of "Elk Watch" signs.  Apparently they had a pilot program in which (I assume tagged) Elk were being tracked and if they were near the road, the crossing signs would flash a yellow light above.  We passed a number of them that were lit but saw no Elk.  Later we saw some that weren't lit and I had to wonder if we had missed seeing some Elk by "just that much."   It's also possible they jusr put flashing lights on some of the signs to make peiople slow down.  We certainly did.


A long time ago, when we travelled a lot with my Mom, Jill began reading in the car.  She would pick a book that the kids would like but that might be a bit much for them to dig into on their own.  It started with the Harry Potter series, but we've gone through some other books as well since then.  Mom used to love listening to Jill, I think it was her only chance to "read" and it felt like we were helping.  But Mom doesn't travel with us anymore, and Jill still reads.  We're currently reading the second Fablehaven book.  On Jill's turn driving Eli even took a couple chapters to read aloud.  He's quite good.

It's not the end of the vacation, or the end of the trip, but this is the end of the commentary.  We're in Wisconsin near our cabin, so all the roads are reruns.  None of the sights are new, and every passing minute feels like an hour.  We just want to get home.  And home we shall get.