XULRunner and Crowbar – Crawling of sorts?

This was going to be a tutorial on getting these two things running to achieve everything I want, sadly I can’t work out how to get the last step working, which is to navigate the returned Ajax page to allow me to extract different information.

As such this is more a guide on getting the two things installed and working – if you have any more luck than I do on getting navigating Ajax working then let me know!!

XULRunner

First things first, I downloaded the Windows version of XULRunner from (look in the runtimes directory!):

http://releases.mozilla.org/pub/mozilla.org/xulrunner/releases/

(Unpacking takes a while the 8.23MB download contained 302 items totalling 18.8MB!)

Crowbar

Not such a simple download for the uninitiated. It’s not actually released, so it uses Subversion to store its files – you’ll need a Subversion client to download it. I don’t have one on the machine I’m working on, so another post will cover the in’s and out’s of downloading Crowbar with subversion.

All Downloaded and Unpacked – Onwards we go!

Back to the instructions here, which tell me once I’ve done all this to open a command prompt (thankfully a place I’m familiar with) and run:

c:\> %XULRUNNER_HOME%\xulrunner.exe --install-app %CROWBAR%\xulapp
c:\> cd %CROWBAR%\xulapp
c:\> %XULRUNNER_HOME%\xulrunner.exe application.ini

Windows Firewall blocked the program but that was kind of expected, so I unblocked that.
I now have a Crowbar window and an Error Console, apparently I can use Crowbar by visting:

http://127.0.0.1:10000/

On doing so, a nice little web window pops up similar to a web proxy, asking me what page I want to fetch.

I inserted my Ajax based page and the next thing I know, I’m being presented with all the source code for that page, which includes all the output from the Javascript that wouldn’t be there when I did a PHP curl get on the page!!

Now apparently I can run this using curl (why can I see me having to install a fair bit of software on my laptop to get this all working over there?).

OK, so all well and good we’ve fetched one page, but that page has a dropdown box on it that forces the entire page to change – how do I go about “Crowbarring” my way around that?

With little documentation I can’t see a way… Back to the drawing/scraping board?

5 Comments

Dan on January 14, 2009 at 3:51 pm

FireWatir might be a better tool for you:

http://wiki.openqa.org/display/WTR/FireWatir

Dan

Keiron on January 14, 2009 at 5:06 pm

I eventually resorted to outsourcing it via rentacoder to an absolutely excellent coder in the US.

He provided exactly what I needed in PHP to the spec I provided with no extensions or the like! Was really pleased with his work – I think he was kind of surprised when I had no complaints or changes that needed making as well!

John on August 27, 2010 at 10:58 am

Keiron,

So you managed to crawl your pages with php using curl and crowbar?!
I would love to see that sourcecode man. I’m having a bit of a tough time curling my way into some javascript, and even with Crowbar installed and ready to go I don’t seem to get the results I want.

What did the curl line you used to call Javascript pages look like?

All the best,

~ John

Keiron on September 2, 2010 at 9:26 am

Hi John,

I outsourced it in the end as I needed to get it done quickly, interestingly I have another project coming up that may need it – I need to dig out the source code I’ll let you know once I can define some decent examples!

John on September 2, 2010 at 2:19 pm

Excellent! I’m glad to hear you got it sorted in the end.
Crowbar is an interesting program… you’d think that reading/interpreting JavaScript would be something that all webspiders would be able to do – yet NONE of them do it for the simple reason that understanding JavaScript and the DOM model requires, essentially, a full browser. Building a full browser into a spidering engine is overkill for just this little bit of added functionality – but when you need to scrape JS, you need to scrape JS!
As a result, the only program that will allow you to do headless JS processing is XULRunner/Crowbar…. but Crowbar doesn’t understand cookies!
If your outsourced method doesn’t do it, I guess the only other option is to either modify Crowbar to understand and send cookie to XULRunner – OR – point the crowbar proxy to another proxy which can inject cookies into the headers send/receive/modify cookies that way.

Both ways would work, and the latter way would probably be more extensible, but it’s not very neat… not to mention both would require me to know how to code applications for XULRunner (which I think is actually all JavaScript and a bit of C using a bridging library, but still… all I can code in is PHP and HTML :P).
I really don’t fancy doing that – so this outsourced sourcecode of your, depending on how it works, could *really* help me out. Certainly save me a LOT of time!!

So yeah, hehe, thanks a lot for your help – and thanks a lot for this blog! It’s probably the only resource on what Crowbar is that exists besides the under-loved and cryptic Crowbar homepage!! Good job!

Trackbacks/Pingbacks

Using Subversion to get Crowbar | Skillett.com - [...] post is just a reference point for another post, a Subversion client is needed for downloading Crowbar, so I…

Skillett.com

XULRunner and Crowbar – Crawling of sorts?

5 Comments

Trackbacks/Pingbacks

Submit a Comment

Pages

Recent Posts

Recent Comments

XULRunner and Crowbar – Crawling of sorts?

5 Comments

Trackbacks/Pingbacks

Submit a Comment

Pages

Recent Posts

Recent Comments

Popular Categories