Excel VBA videos | Excel VBA Part 47 - Browsing to Websites and Scraping Web Page

Posted by Andrew Gould on 14 November 2016

Scraping a web page in VBA involves getting a reference to an HTML document and then processing its elements to extract the useful parts and write them to another location in a readable format. This video shows you how to do that using both Internet Explorer and basic XML HTTP requests. You'll learn a bit about the Document Object Model, as well as how to identify HTML elements by name and by tag type. You'll also see how to loop through various collections of HTML elements and their children using an example involving exchange rate tables.

You can download any files that you need to follow the video here.

You can increase the size of the video:

Full screen mode for YouTube

You can view the video in full screen mode as shown on the left, using the icon at the bottom right of the frame.

You can also increase the quality of the video:

Changing resolution

You can improve the resolution of the video using another icon at the bottom right of the frame. This will slow down the connection speed, but increase the display and sound quality. This icon only becomes visible when you start playing the video.

Finally, if nothing happens when you play the video, check that you're not using IE in compatibility view.

This page has 7 threads Add post
15 Aug 17 at 21:56

Hi Wise Owls!

Thank you for your very interesting videos regarding VBA. Learnt a lot, but much to learn remains. I´m working with a quite large workbook with football results and after I´ve watched your video (#47) I progressed quite a bit. By trial and error, and some help I solved the problem with results like 1-1 turning into a date in my workbook. Have couple of remaining issues, and if any of this issues has been adressed in previous videos, my apologies.

First, when running the code VBA creates a number of sheets, based on website I´m scraping I get three new sheets. The information I want is in the first sheet (Sheet1). Sheet2 and 3 contains some information, but of no use to me. I tried:

Dim Sheet1 as Worksheet 

Set Sheet1 = ActiveWorksheet

What then happens is that the values earlier placed in the new created Sheet1, 2 and 3 ends up as a mess in Sheeet1.I think I´m on to something but I´ve come to a stop.

Since it´s football I´m dealing with, it´s nice with livescores. how can I loop the VBA/scrape the website every minute and overwrite previous values?

The last one is probaly easy, but my patience is tempted to the limit. How can I assign a table to start in a specific cell?

Keep up the good work!

Freddie

 

16 Aug 17 at 11:45

if all you're doing is importing a table of data, I think there's a much, much easier way to do this.  Our network is being worked on at the moment, but I'll blog on this in the next day or two, and update this post when I've published the blog with a link.

21 Aug 17 at 13:40

As promised, here's the blog!

30 Jun 17 at 19:54

Hi

Taken from this web page:http://www.skysports.com/racing/racecards/beverley/27-06-2017/781785/watch-live-racing-and-bet-with-racing-uk-handicap

I have the following :

<p class="print-hide">

<span class="v5-txt-high">Dutch Artist (IRE) </span>

5-1 (9-7) Held up and always towards rear, never on terms, last of 11, 22 1/4l behind Dark Crystal (8-12) at Ayr 7f hcp (5) sft in Jun.

</p>

and can get either the name ("Dutch Artist") or the name & text in one cell , but cannot cannot split the two, ie, name in column A & text in column B

Can you help?

03 Jul 17 at 10:17

You could always read the entire contents into a cell and use some string functions to separate the two parts. Everything to the left of the first ( is the name and everything to the right is the text.  Something along these lines:

    Dim s As String, sName As String, sText As String
    
    s = Range("A1").Value
    
    sName = Left(s, InStr(s, "(") - 1)
    sText = Mid(s, InStr(s, "("))
    
    Debug.Print sName, sText

I hope that helps!

28 Jun 17 at 00:20

Hi Andy, I was following your lecture 47 on web scraping.  I am pretty certain the first time i ran it everthing worked fine however now i keep getting a runtime error 13 type mismatch  on the following line :

 Set HTMLDoc = IE.Document

I have tried both excel 2010 & 2016, and get the same problem. Here is the code:

Option Explicit
Sub BrowseTo()
Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As New MSHTML.HTMLDocument
IE.Visible = True
IE.Navigate ("http://x-rates.com")
Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop
Debug.Print IE.Document, HTMLDoc     'produces output:  [object HTMLDocument]       [object]
Set HTMLDoc = IE.Document
End Sub

In debug mode if I hover the mouse over IE.Document it displays:

<The remote server machine does not exist or is unavailable> usually in excel 2016 it displays <Automation error The interface is unknown> or the first message.

This problem is driving me nuts as the code is so simple, but just will not work. Any help would be greatly appreciated.

Thanks Axel

28 Jun 17 at 06:56

Hi Axel,

I'm unable to replicate the problem after copy-pasting your code in Excel 2010 and 2016. Microsoft has this to say about the error message - is this possibly a case of bad timing? Could either your network or the XRates website have been experiencing issues at the point you were testing your code?

28 Jun 17 at 12:30

Hi Andrew, I apprecaite your prompt reply. I don't think that there is a problem with the website as i have tried a few different sites, and retired many times. Also the page appears in the exploere window that opens up, so i presume it should just be referencing something that has already been loaded. The example following this using XML HTTP requests seems to work fine. I thought it might be antivirus(kaspersky), but again the page is loaded by the internet explorer. Is it possible that it is something do with the enviroment.

Andrew G  
29 Jun 17 at 08:07

Hi Axel, it sounds like you've done all the sensible things to eliminate the potential simple causes. I'm afraid that I'm at a loss - I simply can't replicate the issue that you're having even with a straight copy-paste of your code. Other than Googling the exact text of the message and trawling the various forums to find people who've had the same issue I'm not sure what to suggest.

25 May 17 at 15:15

I'm trying to get information on all pages of a website, but I can only get it from the first page.  The pagination uses JavaScript links like this:

<a href="javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$1')"> 1 </a>
<a href="javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$1')"> 2 </a>


Could you please teach me how to click and follow these links to scape the information on the next page (57 pages in total)?  Thank you!

25 May 17 at 20:33

I'm not sure if you'll be able to do this.  The website was built using ASP.NET.  If you click on one of the paging buttons, it will submit another request to the server, which will then build a new page of HTML and send this back to you.  Have a look at the URL shown in the address bar of your browser.  If this changes, you'll be ableo to run your VBA scraping code on this page also, but if it doesn't everything is being done at the client side, and your VBA code won't be able to get at the information on the different pages.

 

26 May 17 at 15:25

Thanks Andy for your reply, I found the solution for my example, I have to execute javascript code from VBA by calling execScript command:

        i = 2
 
        For i = 2 To 57
            Call objIE.Document.parentWindow.execScript("javascript:__doPostBack('ctl00$cphCenter$gvCompany','Page$" & i & "');", "JavaScript")
        Next i

I discovered that this code can only run without error if I do not use any comment. If I set any comment in this code, the execution will give me error 80020101: Automation error. I really don't understand why the code is affected by comments. :)

Andy B  
26 May 17 at 15:47

Thanks for taking the time to post the reply - a really good solution, which I'm sure will help others in the same situation.

29 Apr 17 at 02:09

Hello Andrew! Greetings from Rio, Brazil!

I perfectly rounded your macro and was very happy with the results. But I found a wall that I could not overcome.

How do I pull data from a table that returns fields surrounded by {{field name}}?

Do you have a solution for this case? Ever seen something like that?

02 May 17 at 08:50

Hi! I'm sorry but I'm not sure - which web page are you attempting to get your data from?

12 Apr 17 at 12:58

Hi Andrew,

Many thanks for your reply, I have implemented the changes

and the routine works perfectly.

Best regards

Tony95

13 Apr 17 at 06:40

You're welcome, Tony!

11 Apr 17 at 14:46

Hi Andrew

I have been watching your videos for some time and must say thanks for the very

informative content. I am trying to adapt the code from part 47 but I only want to loop

over 1 table. I have tried adding (0) to the end of line "Set HTMLTables = HTMLPage.getElementsByTagName("table")

but it produces an error. Any ideas please?

Best regards

Tony95

11 Apr 17 at 21:44

Hi Tony,

When you add (0) to the end of the line you're changing the type returned by the instruction. The (0) indicates that you're returning a single item but the HTMLTables variable has a type of IHTMLElementCollection. Just change the type of variable to IHTMLElement (or use the HTMLTable variable instead), something like this:

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    
    Set HTMLTable = HTMLPage.getElementsByTagName("table")(0)
        
    Worksheets.Add
    Range("A1").Value = HTMLTable.className
    Range("B1").Value = Now
    
    RowNum = 2
    For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
        
        ColNum = 1
        For Each HTMLCell In HTMLRow.Children
            Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
        Next HTMLCell
        
        RowNum = RowNum + 1
    Next HTMLRow
            
End Sub