ignore

Introducing the Custom Web Scraper

Posted on: July 23rd, 2014 by Gareth Brown in How To

The custom scraper is URL Profiler’s web scraping tool, which allows you to quickly extract data from thousands of URLs. Unlike many of the other solutions available, you can extract information from all the rendered source, including anything not rendered in the browser.

Some stuff you can scrape:

  • Text
  • URLs
  • Tracking codes
  • HTML
  • Structured Markup
  • Inline JavaScript and CSS
  • And more

The scraper extracts data from each of the URLs in your list. It can be configured to scrape up to 10 different data values, using a mixture of Regex patterns, XPath or CSS/jQuery selectors. The data is then neatly returned in your results spreadsheet. This function offers an array of different possibilities from scraping user information to collecting data for forensic audits.

How it works

The custom scraper function can be turned on by clicking on the Custom Scraper option within the Content Analysis section.

url-profiler-custom-scraper-dashboard

This will open the configuration window, where you configure the data you want to scrape. Each data selector is configured using the four options.

url-profiler-custom-scraper-panel

1. Selector Type

There are three selector types:

  1. CSS Selector, also uses the jQuery selector patterns
  2. XPath, uses the XPath 1.0 syntax
  3. Regex, slow and for advanced use only.

TIP: The CSS/jQuery selector is the most efficient way to scrape data. It is quicker and less memory intensive than XPath and Regex.

2. Data Type

The data type tells URL Profiler, which data you want returned:

  1. Inner Text – the text within the selected element
  2. Inner HTML – the HTML within the selected element
  3. Attribute – the attribute value of within the selected HTML element

3. Attribute Name

The Attribute Name is only available if you have selected the Attribute data type. For example if you want the URL from an anchor tag you would enter href.

4. Selector

This is where you enter the: CSS Selector, XPath Selector or Regex Pattern depending on which selector type you choose.

TIP: The CSS Selector Type will also work with jQuery selectors.

Note: This is first version of data scraper; currently it does not scrape content generated by JavaScript or Ajax.

A Working Example

Matt Barby has a great tutorial on how to use SEO Tools for Excel to scrape communities.  The Custom Scraper in URL Profiler makes a great alternative as it can process and scrape the data a lot faster than Excel and works on Apple’s OSX!

In the example I’m going to show you how to scrape user information from the Inbound.org community. I’m not going to cover how to scrape the user’s URLs, but if you’d like to follow along you can download this CSV to import.

I’m also not advocating your scrape Inbound, that would be rude 😉

inbound-profile-animation

You’re going to collect:

  1. User’s Name
  2. Job title
  3. Place of work
  4. User’s website URL
  5. Social network links
  6. Karma score
  7. Share count
  8. Number of followers
  9. Number users being followed

To complete this task you’ll need Google Chrome or know a little about CSS or jQuery Selectors. I’m sure you can do this with Firefox, but I’d rather use a superior browser!

1. Open URL Profiler and Import or paste in your URLs

scraper-import-users

2. To retrieve social network links select Social Accounts within URL Level Data

scraper-social-accounts

3. Let’s configure the scraper

scraper-custom-scraper

Select the Scrape Data option in the Content Analysis section. This will open the Custom Scraper settings panel, where you will enter the selectors for each piece of data to collect.

Data 1 the User’s Name – Choose CSS Selector for the Selector Type and Inner Text for the Data Type (the Attribute field will be disabled and as it’s not used). scrape-data-1 Selecting the element’s CSS Path using Google Chrome. You will use this same procedure for each of the data items.

  1. Now Open Google Chrome and navigate to one of the user profiles
  2. Right click on the text you want to scrape, in this case the Patrick Coombe header, and then click Inspect element, which open the Developer tools panel.
  3. Right click on the highlighted element
  4. Select Copy CSS Path
  5. Move back to URL Profiler and paste the path into the Selector Path field

body > div.banner > div > h1

scrape-panel-1

Data 2 the Job Title – selecting the job title element’s CSS Path using Google Chrome using the same procedure as above.

scrape-data-2

body > div.banner > div > div.tagline

scrape-panel-2

Data 3 company name – again we’re going to use the CSS Path.

scrape-data-3

body > div.banner > div > div.tagline > a

scrape-panel-3

Data 4 website URL – copy CSS Path, but this time to get the URL we’re going to get the HREF attribute.

scrape-data-4

body > div.banner > div > div.social > a:last

scrape-panel-4

Data 5 Karma – we’re now going to switch to XPath. You will notice the parent tag contains a span contain the word Karma. All we want is the number, so an XPath selector is the best choice.

scrape-data-5

/html/body/div[1]/div/ul/li[1]/text()

scrape-panel-5

Data 6 Shares, Data 7 Followers and Data 8 Following – Use Copy XPath to get the rest of the counts.

scrape-panel-8

That’s it, click the apply button and run the profiler. And…if you’re interested here’s the results.

Gareth Brown

By Gareth Brown

I do all the work and don't have time to write posts. Even if did, Patrick would sign them off as his. He's even got a bigger profile photo - what does that tell you? Follow me on Twitter or circle me on

If You Like the Sound of URL Profiler,
Download a Free Trial Today

(You'll be amazed by how much time it saves you, every day!)

  • Free 14 day trial (full feature)
  • No credit card required
  • License from only £12.95 a month

Comments

  • Tom

    Cool :). Although I hate XPath for it’s lack of flexibility (can’t scrape a specific part of a html element or look for similar patterns). I prefer regular expressions: http://blog.tkacprow.pl/excel-scrape-html-add/

  • Jonathan Jones

    Cool. Is there a way to grab anchor text from links? I’ve got the the grabbing the URL all down, but would need the anchor text as well.

    • Jonathan Jones

      Ah, never mind. The ‘Link Analysis’ feature does the job well! 🙂

      • HathawayP

        Yep, you figured it out – link analysis will do the anchor text for you, and all sorts of other stuff.

  • neting

    Hi Gareth
    can we use URL profiler to scrape a list of expired domains from https://www.expireddomains.net?
    I’m trying to get a list of expired domains from a page such as
    https://www.expireddomains.net/deleted-com-domains/

    Each expired domains is in a list of tags which can be selected as > a.namelinks
    I’ve tried but it seems not working…
    Thanks

    Luca

    • HathawayP

      Hi Luca,
      The Custom Scraper can only scrape 10 items from a single page, so this sort of page is not really suitable.

      However, with this specific example, why don’t you just register for an account and export the results?

      Thanks,
      Patrick

  • Rekham Khan

    i can not extract location from twitter profiles…its not working for me :/

Ready to take your content auditing seriously?