How to scrape data from sites you can't log into

tarmac · on Feb 24, 2009

You're still technically logging in by providing the copied cookie. I don't see any difference here.

Or say the title should be changed to "How to scrape your data from your sites that require login"

almost · on Feb 24, 2009

Agreed, this more of a "how to scrape data from sites you can log into". It's not even a very useful example at that.

For anyone who wants to scrape sites that require login I'd recomend Python with Twill. That lets you do the whole thing with ease.

geoscripting · on Feb 24, 2009

Twill is an option indeed, but this way you don't miss out on the javascript. You can take advantage of all of your browser's features.

almost · on Feb 24, 2009

But the article is suggesting just copying cookies from Firefox into a simple Java based scraper. That won't support Javascript either.

If you needed Javascript you could use one of the Firefox scripting bridges (Selenium or MozRepl).

sho · on Feb 24, 2009

Huh? What he did is completely automatable, you don't have to touch Firefox. And if you really did need JS to log in, which I have never seen outside misguided banks, there's tools for that too. Selenium and JSSH come to mind but for 99% of sites you'd just need Mechanize.

And Java? Why the hell would anyone write a script in a compiled language like Java? Desperate for that 2ms time saving between 10 second waits for the pages to come down, eh? And any for-real scraping script would have a time delay built in anyway.

The guy doesn't know what he's talking about.

geoscripting · on Feb 24, 2009

The example wasn't written in Java for performance gain. It so happened that I had NetBeans open , and it was easier for me to write it in Java at the moment :).

sho · on Feb 24, 2009

Easier!? You wrote pages and pages describing the most inefficient way imaginable to do something I can do in 5 lines of Ruby, and you call it easier? And unless I'm very much mistaken, you'd have to compile the code anew whenever the cookie changed?

Well, good luck to you, and the more script kiddies you confuse the better I guess, but there are seriously much better ways to do this. Go look at Ruby Mechanize (I think it's also available for python); coming from Java you will be blown away by just how easy this kind of thing is. How do you think we all test? ; )

Update: Oh I see you know Mechanize from another article. So why not just use that ... you do know it can do all that logging in stuff for you, right?

geoscripting · on Feb 24, 2009

Yes, I have worked with mechanize before. I was using mechanize even when there only was the perl version. I added a comment to the article explaining my choice.

sho · on Feb 24, 2009

Fair enough. I guess the surprised reaction you're getting is because web testing frequently involves doing this kind of thing, so, being a community of web programmers, everyone here knows it backwards. I didn't really think of the angle you mentioned where someone wouldn't know all the relevant techniques and just want to get something working ASAP. For that, taking the cookie from FF might indeed be a time saver.

Anyway always good to see everyone chime in with their opinion so thanks for the conversation starter.

BTW, is anyone else nervous about the day the teenage h4xx0rs discover how easy this kind of thing is these days ..

timf · on Feb 24, 2009

Seems to me this is just saying: log in via the browser (with the right password), then use the generated cookie with the scraping code. You can also do that with wget --load-cookies

geoscripting · on Feb 24, 2009

True, you can use wget as well, but this way you have more control over your code.

timf · on Feb 24, 2009

For sure, I was responding to the title, it sounded like a security thing.

geoscripting · on Feb 24, 2009

What did you thought this will be about? :)

apgwoz · on Feb 24, 2009

But, if the site is using only HTTP and cookies, there's no reason not to first make a request to the login page with the username/password and retrieve the cookie via the "cookie" header that comes back... Did I totally misread the article, or was it just dumb?

timf · on Feb 24, 2009

A lot of scraping frameworks will handle it for you, too. It's even the front page example code @ http://jwebunit.sourceforge.net/

I think most people are reacting to the title... I for example thought it was a security posting.

apgwoz · on Feb 24, 2009

Of course there are frameworks to do this for you, and I'm sure jwebunit will work perfectly well. I guess I'm disappointed that the author didn't understand the fact that Firefox doesn't perform magic, and that a login page is no different than anything else being scraped.

ewiethoff · on Feb 24, 2009

I'm disappointed the author didn't realize that he doesn't need Cookie Monster to see Firefox cookies: Tools -> Options -> Privacy -> Show Cookies

geoscripting · on Feb 24, 2009

I think you misread the article.

apgwoz · on Feb 24, 2009

I definitely didn't. The article mentions that you login, pull out the cookies, copy the value INTO your code, and since his examples are using java, recompile it.

babo · on Feb 24, 2009

I do my scrapping with BeautifulSoup from Python, actually from an iPython shell. With a urllib2 opener you could easily handle cookies and UserAgent pretty easy, the later is also important for some sites.

geoscripting · on Feb 24, 2009

There's also mechanize for python. It can be found here : http://wwwsearch.sourceforge.net/mechanize/ . It handles cookies by default, and is a pretty good tool.

fugue88 · on Feb 25, 2009

man wget

wget --save-cookies cookies.txt ...

wget --load-cookies cookies.txt ...

rams · on Feb 24, 2009

Can someone suggest a browser-independent way of handling login pages that require javascript ? Twill doesn't handle javascript.

geoscripting · on Feb 25, 2009

You could give HtmlUnit a try.

spoiledtechie · on Feb 24, 2009

interesting.

Mystalic · on Feb 24, 2009

I wonder if this works every time...

And if it did, how long until a defense is made...

Retric · on Feb 24, 2009

There is nothing magic about web browsers, telent to port 80 at www.google.com and with a simple get request they will spit back their website. You can make it a little harder to do this stuff but a packet sniffer is always going to let you pretend to be any software you want unless they are using encryption. Also because Firefox is open source you can't prevent people from scripting using it anyway.

PS: I recommend all aspiring coders to telent to www.google.com at least once just to feel the magic.

timf · on Feb 24, 2009

"Firefox is open source you can't prevent people from scripting using it"

There are frameworks for doing that already, check out Selenium: http://seleniumhq.org/

geoscripting · on Feb 24, 2009

That's how it all starts :)

radu_floricica · on Feb 24, 2009

> PS: I recommend all aspiring coders to telent to www.google.com at least once just to feel the magic.

I'm pretty sure it's illegal in a couple of ways. </bitter>

But yes, the magic is there.

xenophanes · on Feb 24, 2009

illegal!? why!?

geoscripting · on Feb 24, 2009

It is recommended that you use the API that google provides for searching, but I fail to see why telnetting would be illegal. After all, both firefox and telnet use sockets to do their job.

radu_floricica · on Feb 24, 2009

Again, the comment above wasn't meant to be taken literally, but just because they both use sockets doesn't mean they're just as legal.

Imagine a web site which terms of service state you cannot use software to circumvent ads. Or where part of the security is done client-side (stupid, yes, but not impossible). Skipping the browser breaches at least the terms of service, and may be constructed as hacking. I think even Google discourages automated searching and prefers you use its api, which (at least some years ago) wasn't free for commercial use. I may be wrong in this particular case, but the important point is you may want to check the specific TOS before skipping the browser.

radu_floricica · on Feb 24, 2009

DMCA, of course. The reply was meant to be exaggerated, I hope telnet itself isn't illegal.

philh · on Feb 24, 2009

Some sites will expire the cookie after a period of inactivity, in which case this will stop working. But it can easily be refreshed by getting a new set of cookies, which as other people have pointed out is a process that can be automated.

cliffy · on Feb 24, 2009

You could make it very onerous to move around the site even if you're logged in. For example, after every few links you follow you have to pass another Turing test. That would be completely unreasonable though.

I think once a user is logged into your site, you'd have an extremely hard time defeating this sort of behavior without degrading the quality of user interaction or treading on legitimate use of your site. You may be able to defeat egregious abuses, such as scraping entire photo galleries in seconds, but even that can be defeated by a script that randomizes requests/times between requests.

geoscripting · on Feb 24, 2009

More and more websites are adding Turing tests after relatively short intervals of time. I consider that to be a good thing, and there are some good ways to stop/identify spiders. You can check out this site: http://stackoverflow.com/questions/450835/how-do-you-stop-sc... . It shows some pretty good techniques that one can use to defend from spiders.