Uncover Hidden Gems with Screen Scrapers
Several in the forums have asked how to compile real estate data from public sources. One guy wanted to find a way to quickly move data from foreclosure.com into a spreadsheet. Another needed a way to more easily search results from Redfin.
"Screen scraping" is a programming technique used to extract data from websites. Both of the problems above can be solved with screen scraping.
Here in Lubbock, Texas the county publishes property tax records on a public website. The site will let you search by address or by owner, but I wanted to be able to search by taxes owed, out-of-town owners, and other criteria. So, I wrote a screen scraper using Ruby and MySQL. It is a little program that is only on my computer. I tell it which parts of Lubbock to scan and it crawls through the tax website, page by page, extracting data. It puts every property and data about that property into a database that is also on my computer. Once the data is in my own database, I can search it much more easily. This little program has uncovered many gems.
Every web page is coded in HTML. With HTML, a page's content is wrapped in little pieces of code that define that content. A screen scraper is an algorithm that has been "taught" what the HTML means for a particular page or site. Every site is different, so most screen scrapers need to be customized per site.
It took me about 2 hours to write my basic screen scraper. But, to get it just right took some testing and about 8 total hours.
Writing a screen scraper is a simple task for a seasoned programmer. (If the site requires a login, it can be a little more difficult.) You can find programmers on Upwork.com. Or, find a relative who is learning to program and pay them $100 to build one for you. They'll have fun writing the program, have an opportunity to practice, and earn a little cash.
Warning: Many sites have copyright or other terms of use that might prevent screen scraping or limit what you do with that scraped data. This article is purely technical and not legal advice.
Comments (2)
Thanks for sharing this with me today.
I'm going to check out upwork.com
Another great source is young programmers at Texas Tech :) haha.
I'd like to play around with different ideas and see what they can build.
Austin Hughes, over 9 years ago
Actually it is a form of send-expect used to extract data from screens. It's been around much longer than the web - or even graphical interfaces.
... but that's just an old UNIX geek being pedantic ;-)
Roy N., over 9 years ago