In my recent nightly adventures, I’ve been working with my buddy’s start-up called PupBox. We’ve been spending a significant amount of time focusing on outreach and other strategic ways to help build their customer base, acquire more back links, and also spread the word about their amazing, fun product. This exercise has led me on a magical quest into the world of web scraping. Sit tight, and get ready for some eye opening insights into how to be more efficient with your Internet research using web scraping techniques.
An Explanation Of Web Scraping
Web scraping or web harvesting is the technique of gathering specific data from websites using bots or web crawlers that will mimic the actions of a human when browsing a website. Using bots or crawlers you are able to automate processes that would normally take significantly more time to perform manually. Web scraping also focuses on taking random data that is unstructured and formatting it in a way that can easily be transferred into excel, csv, or other desired file format.
Uses Of Web Scraping
Web scraping has many uses that are driven by the desired outcome of the individual or company that is executing the web crawler. Both individuals and companies alike benefit from web scraping. As an Internet marketer, I rely heavily on data that is both numeric and personal. When doing web research to identify key social influencers for content promotion, I need scrapers to help gather the name and email address of specific individuals to help make my strategy more effective. Companies like Padmapper rely on web scraping to deliver an experience to their users. Padmapper scrapes Craigslist for rental properties that meet the specified information you request. Please note: This led to a lawsuit between Craigslist and Padmapper.
Techniques For Web Scraping
The techniques for web scraping vary widely in effort and complexity. Some of the main web harvesting techniques are as follows:
Copy and pasting
This one is exactly what’d you expect. It means literally going to a website and copying the information you need and then pasting it into the document of your choice. If you really want to get crazy, you can go ahead and right click to select all.
Text grepping and regular expression matching
Text grepping is a command-line utility that allows you to search plain text on websites that match a regular expression. Originally developed for Unix, but has evolved to include other operating systems (OS). An example of a regular expressions can be found here.
HTTP programming is when you retrieve static or dynamic web pages by posting HTTP requests through a remote server using socket programming. This form of programming is very confusing as there appears to be many different types so rather try to fumble to explain click here.
An HTML parser allows you to mine data by detecting a common script, template and/or code on a specific website or web page. This is generally carried out by one of many main programming languages such as XQuery, HTQL, Python, iava, PHP, etc. The data that is mined then is translated and extracted into the desired structured format (or file type).
DOM parsing is the practice of retrieving dynamic content generated by client side scripts that execute in a web browser such as Internet Explorer (barf), Mozilla Firefox, or Google Chrome. Client side scripts are usually embedded within an HTML or XHTML document. The dynamic content is typically formatted in XML which enables it to be transferred from the website into your specified format.
Ahh.. finally something that is easier to explain! There are an infinite amount of web scraping software options available on the Internet. All you need to do is determine what information you are looking to scrape and then hit Google to start performing some searches to find what you need. There are software’s that make my life as an SEO much easier such as ScrapeBox, ScreamingFrog or URLProfiler. I’ve downloaded these to my desktop and run them almost daily. There are other more recent web-scraping software’s such as Mozenda, Kimono Labs, or Import.io which allow you to easily select web page elements you would like to extract. These elements are dumped into structured columns and rows in an automated fashion and exported into an excel file or even custom API.
Personal Example Of Web Scraping
As briefly mentioned in the article intro, I’ve been spending my evenings experimenting with ways to extract data from websites. I’ve tried almost every tool imaginable and I’m still demoing new ones daily. Most recently, I created my own social media scraper for scraping social profiles from websites and also an email address scraper for gathering email addresses to do content promotion.
Aside from creating these web scrapers that can scrape any URL looking for specific information, I’ve been creating custom one-off scripts that will run on the website for which they are designed. This helped me tremendously when working on a client project. Our initial effort involved paying a freelancer to gather contact details (e.g first name, last name, and email address) from a specific online database. The woman was able to gather the requested information for about 30 people per hour. So we paid her for 10 hours of work and she got information for about 300 individuals. This wasn’t bad considering her hourly rate was $5. I later went on to create a custom script that blew her efforts out of the water! The script cost $60 and can gather contact details at a much more rapid pace. Using the new script, I was able to gather contact details for 4,000 people in 10 minutes..