Excel VBA videos | Excel VBA Part 47 - Browsing to Websites and Scraping Web Page

Posted by Andrew Gould on 14 November 2016

Scraping a web page in VBA involves getting a reference to an HTML document and then processing its elements to extract the useful parts and write them to another location in a readable format. This video shows you how to do that using both Internet Explorer and basic XML HTTP requests. You'll learn a bit about the Document Object Model, as well as how to identify HTML elements by name and by tag type. You'll also see how to loop through various collections of HTML elements and their children using an example involving exchange rate tables.

You can download any files that you need to follow the video here.

You can increase the size of the video:

Full screen mode for YouTube

You can view the video in full screen mode as shown on the left, using the icon at the bottom right of the frame.

You can also increase the quality of the video:

Changing resolution

You can improve the resolution of the video using another icon at the bottom right of the frame. This will slow down the connection speed, but increase the display and sound quality. This icon only becomes visible when you start playing the video.

Finally, if nothing happens when you play the video, check that you're not using IE in compatibility view.

This page has 5 threads Add post
28 Jun 17 at 00:20

Hi Andy, I was following your lecture 47 on web scraping.  I am pretty certain the first time i ran it everthing worked fine however now i keep getting a runtime error 13 type mismatch  on the following line :

 Set HTMLDoc = IE.Document

I have tried both excel 2010 & 2016, and get the same problem. Here is the code:

Option Explicit
Sub BrowseTo()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As New MSHTML.HTMLDocument
IE.Visible = True
IE.Navigate ("http://x-rates.com")
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Debug.Print IE.Document, HTMLDoc     'produces output:  [object HTMLDocument]       [object]
Set HTMLDoc = IE.Document
End Sub

In debug mode if I hover the mouse over IE.Document it displays:

<The remote server machine does not exist or is unavailable> usually in excel 2016 it displays <Automation error The interface is unknown> or the first message.

This problem is driving me nuts as the code is so simple, but just will not work. Any help would be greatly appreciated.

Thanks Axel

28 Jun 17 at 06:56

Hi Axel,

I'm unable to replicate the problem after copy-pasting your code in Excel 2010 and 2016. Microsoft has this to say about the error message - is this possibly a case of bad timing? Could either your network or the XRates website have been experiencing issues at the point you were testing your code?

25 May 17 at 15:15

I'm trying to get information on all pages of a website, but I can only get it from the first page.  The pagination uses JavaScript links like this:

<a href="javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$1')"> 1 </a>
<a href="javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$1')"> 2 </a>


Could you please teach me how to click and follow these links to scape the information on the next page (57 pages in total)?  Thank you!

25 May 17 at 20:33

I'm not sure if you'll be able to do this.  The website was built using ASP.NET.  If you click on one of the paging buttons, it will submit another request to the server, which will then build a new page of HTML and send this back to you.  Have a look at the URL shown in the address bar of your browser.  If this changes, you'll be ableo to run your VBA scraping code on this page also, but if it doesn't everything is being done at the client side, and your VBA code won't be able to get at the information on the different pages.

 

26 May 17 at 15:25

Thanks Andy for your reply, I found the solution for my example, I have to execute javascript code from VBA by calling execScript command:

        i = 2
 
        For i = 2 To 57
            Call objIE.Document.parentWindow.execScript("javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$" & i & "');", "JavaScript")
        Next i

I discovered that this code can only run without error if I do not use any comment. If I set any comment in this code, the execution will give me error 80020101: Automation error. I really don't understand why the code is affected by comments. :)

Andy B  
26 May 17 at 15:47

Thanks for taking the time to post the reply - a really good solution, which I'm sure will help others in the same situation.

29 Apr 17 at 02:09

Hello Andrew! Greetings from Rio, Brazil!

I perfectly rounded your macro and was very happy with the results. But I found a wall that I could not overcome.

How do I pull data from a table that returns fields surrounded by {{field name}}?

Do you have a solution for this case? Ever seen something like that?

02 May 17 at 08:50

Hi! I'm sorry but I'm not sure - which web page are you attempting to get your data from?

12 Apr 17 at 12:58

Hi Andrew,

Many thanks for your reply, I have implemented the changes

and the routine works perfectly.

Best regards

Tony95

13 Apr 17 at 06:40

You're welcome, Tony!

11 Apr 17 at 14:46

Hi Andrew

I have been watching your videos for some time and must say thanks for the very

informative content. I am trying to adapt the code from part 47 but I only want to loop

over 1 table. I have tried adding (0) to the end of line "Set HTMLTables = HTMLPage.getElementsByTagName("table")

but it produces an error. Any ideas please?

Best regards

Tony95

11 Apr 17 at 21:44

Hi Tony,

When you add (0) to the end of the line you're changing the type returned by the instruction. The (0) indicates that you're returning a single item but the HTMLTables variable has a type of IHTMLElementCollection. Just change the type of variable to IHTMLElement (or use the HTMLTable variable instead), something like this:

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    
    Set HTMLTable = HTMLPage.getElementsByTagName("table")(0)
        
    Worksheets.Add
    Range("A1").Value = HTMLTable.className
    Range("B1").Value = Now
    
    RowNum = 2
    For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
        
        ColNum = 1
        For Each HTMLCell In HTMLRow.Children
            Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
        Next HTMLCell
        
        RowNum = RowNum + 1
    Next HTMLRow
            
End Sub