2017-04-25

In this article, I will cover what I believe are the most common values you would want to scrape from a website. Think along the lines of:

  • the content of h1, h2, h3 tags etc.
  • anchor tags
  • the links to all of the images
  • the alt tags of images

Tools

In these code examples I will use C# with HtmlAgilityPack. In the past few years, I have used HtmlAgilityPack to scrape content well over 100 million websites. It does have some key shortcomings, most notably its inability to deal with pages that load dynamic content with JavaScript. But by and large, it is likely to be perfectly adequate in many situations.

There is another library which is being actively maintained called AngleSharp, which would be easier to use if you are well used to JQuery CSS Selectors. AngleSharp will probably be the better choice for many coders. I have found it to be generally faster than HtmlAgilityPack at scraping the same content. However, I have managed to cause it to hit 100% CPU usage when using it aggressively, for example running 20 concurrent scrapes. I have successfully used HtmlAgilityPack in up to 40 concurrent processes, and it has really proven its worth. Whenever the CPU maxed out, or caused a program to hang, it was always as a result of terrible html used in a website. This is rare, so we won't get too ogged down on that!

Source Code

The source code here is over on GitHub - https://github.com/ahernc/scraping-demo

Loading the content

Think of using a web browser: you will type a URL, wait for the page to load, and look at the content. To do this in code, you will use a HtmlWeb object, and read the content from the webpage into a HtmlDocument. The HtmlDocument is the object we then use to parse the content we need.

var web = new HtmlWeb();
var document = web.Load(url);

In my demo code, I have a few extra things like PreRequestHandler and PostResponseHandler. The code in the PreRequestHandler gives you a bit of extra control over the http request, and may help you avoid bot detection. The PostResponseHandler will give you some extra information in the header of the response.

Finding some values

Values are found using XPath syntax. Here, I will walk through some of the examples I have written in the sample code.

h1

To find the h1 tags, the query will be simply //h1. Running this query using HtmlAgilityPack will return a HtmlNodeCollection, which is a collection of type HtmlNode.

var nodes = document.DocumentNode.SelectNodes($"//h1");

Before you iterate through the HtmlNodeCollection, you should always check to the above query did not return null. This goes for every query you run in XPath on any given website. Html will vary hugely on websites, so you can never be certain that the XPath query will return what you expected. After the null check, step through each item to see its content.

if (nodes != null)
    foreach (var node in nodes)
    {
        // InnerHtml is the whole html... 
        Debug.WriteLine(TidyValue(node.InnerHtml));

        // InnerText is what's between the opening and closing of the h1 tag
        Debug.WriteLine(TidyValue(node.InnerText));
    }

As stated in the comments, the InnerHtml will return the whole lot of the html of the node, including attributes like css, data attributes etc. For example, it might return something like <h1 class="blue-heading" data-id="12345">The Headlines </h1>. Realistically, you are probably more interested in the InnerText, which in this case would return only what sits between the tags.

In the sample source code, the above code is in a function I called FindHTags. There is a variable called nodeType which you can set to pull h3, h4, h5 tags etc.

meta tags

Websites will typically have at least two meta tags: one with a description attribute and another with a keywords attribute. The XPath syntax to find the meta tag with the description attribute is //meta[@name='description']. Now we can look at how to select a single HtmlNode to find the meta description:

var metaDescription = document.DocumentNode.SelectSingleNode("//meta[@name='description']");
if (metaDescription != null)
{
    if (metaDescription.Attributes["content"] != null)
        Debug.WriteLine($"meta description: {metaDescription.Attributes["content"].Value}");
}

To pull the keywords, the syntaxt is very similar. Just change the @name part of the query to be the keywords attribute:

var metaKeywords = document.DocumentNode.SelectSingleNode("//meta[@name='keywords']");
if (metaKeywords != null)
{
    if (metaKeywords.Attributes["content"] != null)
        Debug.WriteLine($"meta keywords: {metaKeywords.Attributes["content"].Value}");
}
Javascript content

In the demo source code, you will see a function called PrintContentOfScriptTags. This pulls the content of all JavaScript on the page. The syntax is very similar to that of the h1 tags. Just place the type of element you want to retrieve from the html after the two forwardslashes.

private static void PrintContentOfScriptTags(HtmlDocument document)
{
    Debug.WriteLine(line);

    var nodes = document.DocumentNode.SelectNodes($"//script");
    if (nodes != null)
        foreach (var node in nodes)
        {
            Debug.WriteLine(node.InnerText);
        }
}        
Combining a "contains" clausein an attribute query

Supposed you want to pull all of the script tags which specify a name of a specific Javascript file. For example, the html contains something like <script @src='/somefile.js'></script>. We saw earlier how to select tags with an exact class name. Here we can apply a contains clause in the against @src attribute.

The example here is somewhat basic in that all we are looking for is the collection of scripts where the name of the Javascript file contains /js, but you get the idea.

private static void GetScriptTagsWithMatchingSrcAttribute(HtmlDocument document)
{
    Debug.WriteLine(line);
    // This should yield some results: referenced scripts within a /js folder.
    // Change for the site you are checking... 
    var nodes = document.DocumentNode.SelectNodes("//script[contains(@src,'/js')]");

    // And just output the full html
    if (nodes != null)
        foreach (var node in nodes)
        {
            // The full html of the node is in the OuterHtml... 
            Debug.WriteLine(node.OuterHtml);
        }
}
Finding all of the anchors, and displaying the complete URLs

The query here is simple: //a.

var anchors = document.DocumentNode.SelectNodes("//a");

But not so fast! I have thrown in some extra functionality in the demo code to help you list all of the URLs in a more user-friendly fashion. There are a few "gotchas" that will trip you up if you try to pull the href attribute of an anchor tag and then blindly assume that the link will be useful if you were to save the link for later! The possible variations are along the lines of

  • href="/some_page"
  • href="../../some_folder/another_folder/index.html"
  • href="//some_folder/somepage.php?id=123"
  • href="http://www.thesite.com/some_page.php?id=123"

The best case scenario is always the last one, i.e. where the protocol is included, when it starts with http:// or https://. These are guaranteed to be correct (or at least shold be!). The trickier ones are where you need to traverse up through to a home directory.

Take a look in the function FindAllTheAnchors in the source code if you want to see all of these scenarios covered.

And if you are stuck trying to pull out certain tags, email me. I might even add the answer here!

Next Post Previous Post