This page has 9 threads | Add post

Blogs

Finding our blogs useful? Please consider helping us to share more!

Advice on how to scrape tables of data or HTML from webpages using VBA Part three of a three-part series of blogs
You can use VBA to extract data from web pages, either as whole tables or by parsing the underlying HTML elements. This blog shows you how to code both methods (the technique is often called "web-scraping"). Two ways to get data from websites using Excel VBA Extracting a table of data from a website using a VBA query Scraping a website HTML in VBA (this blog)

Advice on how to scrape tables of data or HTML from webpages using VBA

Part three of a three-part series of blogs

You can use VBA to extract data from web pages, either as whole tables or by parsing the underlying HTML elements. This blog shows you how to code both methods (the technique is often called "web-scraping").

Two ways to get data from websites using Excel VBA
Extracting a table of data from a website using a VBA query
Scraping a website HTML in VBA (this blog)

Posted by Andy Brown on 13 January 2014

In this blog

Step 1 - Defining the problem
Step 2 - Analysing the target website
Step 3 - Referencing the required applications
Step 4 - Getting at the underlying HTML
Step 5 - Parsing the HTML

Scraping a website HTML in VBA

This blog shows how to go through a website, making sense of its HTML within VBA. We'll break the problem down into several chunks - to whit:

Defining what we want to achieve.
Analysing the target URL (the target website).
Referencing the required applications.
Getting at the underlying HTML.
Parsing the data into Excel.

Don't forget that websites change all the time, so this code may no longer work when you try it out as the format of the StackOverflow website may have changed. The following code also assumes that you have Internet Explorer on your computer (something which will be true of nearly all Windows computers).

Step 1 - Defining the problem

At the time of writing, here is what the above-mentioned StackOverflow website's home page looks like:

The home page lists out the questions which have been asked most recently.

From this we want to extract the raw questions, with just the votes, views and author information:

What the answer should look like. The list of questions changes by the second, so the data is different!

To do this we need to learn the structure of the HTML behind the page.

To scrape websites you need to know a little HTML, and knowing a lot will help you enormously.

Step 2 - Analysing the target website

In any browser you can right-click and choose to show the underlying HTML for the page:

How to show the HTML for a webpage in FireFox (the Internet Explorer, Chrome, Safari and other browser options will be similar).

The web page begins with Top Questions, so let's find that:

Press CTRL + F to find the given text.

Analysing the HTML which follows this shows that the questions are all encased in a div tag called question-mini-list:

We'll loop over all of the HTML elements within this div tag.

Here's the HTML for a single question:

The question contains all of the data we want - we just have to get at it!

Here's how we'll get at the four bits of data we want:

Data	Method
Id	We'll find the div tag with class question-summary narrow, and extract the question number from its id.
Votes	We'll find the div tag with class name votes, and look at the inner text for this (ie the contents of the div tag, ignoring any HTML tags). By stripping out any blanks and the word vote or votes, we'll end up with the data we want.
Views	An identical process, but using views instead of votes.
Author	We'll find the tag with class name started, and look at the inner text of the second tag within this (since there are two hyperlinks, and it's the second one which contains the author's name).

Step 3 - Referencing the required applications

To get this macro to work, we'll need to:

Create a new copy of Internet Explorer in memory; then
Work with the elements on the HTML page we find.

To do this, you'll need to reference two object libraries:

Library	Used for
Microsoft Internet Controls	Getting at Internet Explorer in VBA
Microsoft HTML Object Library	Getting at parts of an HTML page

To do this, in VBA choose from the menu Tools --> References, then tick the two options shown:

You'll need to scroll down quite a way to find each of these libraries to reference.

Now we can begin writing the VBA to get at our data!

Step 4 - Getting at the underlying HTML

Let's now show some code for loading up the HTML at a given web page. The main problem is that we have to wait until the web browser has responded, so we keep "doing any events" until it returns the correct state out of the following choices:

Enum READYSTATE

READYSTATE_UNINITIALIZED = 0

READYSTATE_LOADING = 1

READYSTATE_LOADED = 2

READYSTATE_INTERACTIVE = 3

READYSTATE_COMPLETE = 4

End Enum

Here a subroutine to get at the text behind a web page:

Sub ImportStackOverflowData()

'to refer to the running copy of Internet Explorer

Dim ie As InternetExplorer

'to refer to the HTML document returned

Dim html As HTMLDocument

'open Internet Explorer in memory, and go to website

Set ie = New InternetExplorer

ie.Visible = False

ie.navigate "http://stackoverflow.com/"

'Wait until IE is done loading page

Do While ie.readyState <> READYSTATE_COMPLETE

Application.StatusBar = "Trying to go to StackOverflow ..."

DoEvents

Loop

'show text of HTML document returned

Set html = ie.document

MsgBox html.DocumentElement.innerHTML

'close down IE and reset status bar

Set ie = Nothing

Application.StatusBar = ""

End Sub

What this does is:

Creates a new copy of Internet Explorer to run invisibly in memory.
Navigates to the StackOverflow home page.
Waits until the home page has loaded.
Loads up an HTML document, and shows its text.
Closes Internet Explorer.

You could now parse the HTML using the Document Object Model (for those who know this), but we're going to do it the slightly harder way, by finding tags and then looping over their contents.

Step 5 - Parsing the HTML

Here's the entire subroutine, in parts, with comments for the HTML bits. Start by getting a handle on the HTML document, as above:

Sub ImportStackOverflowData()

'to refer to the running copy of Internet Explorer

Dim ie As InternetExplorer

'to refer to the HTML document returned

Dim html As HTMLDocument

'open Internet Explorer in memory, and go to website

Set ie = New InternetExplorer

ie.Visible = False

ie.navigate "http://stackoverflow.com/"

'Wait until IE is done loading page

Do While ie.readyState <> READYSTATE_COMPLETE

Application.StatusBar = "Trying to go to StackOverflow ..."

DoEvents

Loop

'show text of HTML document returned

Set html = ie.document

'close down IE and reset status bar

Set ie = Nothing

Application.StatusBar = ""

Now put titles in row 3 of the spreadsheet:

'clear old data out and put titles in

Cells.Clear

'put heading across the top of row 3

Range("A3").Value = "Question id"

Range("B3").Value = "Votes"

Range("C3").Value = "Views"

Range("D3").Value = "Person"

We're going to need a fair few variables (I don't guarantee that this is the most efficient solution!):

Dim QuestionList As IHTMLElement

Dim Questions As IHTMLElementCollection

Dim Question As IHTMLElement

Dim RowNumber As Long

Dim QuestionId As String

Dim QuestionFields As IHTMLElementCollection

Dim QuestionField As IHTMLElement

Dim votes As String

Dim views As String

Dim QuestionFieldLinks As IHTMLElementCollection

Start by getting a reference to the HTML element which contains all of the questions (this also initialises the row number in the spreadsheet to 4, the one after the titles):

Set QuestionList = html.getElementById("question-mini-list")

Set Questions = QuestionList.Children

RowNumber = 4

Now we'll loop over all of the child elements within this tag, finding each question in turn:

For Each Question In Questions

'if this is the tag containing the question details, process it

If Question.className = "question-summary narrow" Then

Each question has a tag giving its id, which we can extract:

'first get and store the question id in first column

QuestionId = Replace(Question.ID, "question-summary-", "")

Cells(RowNumber, 1).Value = CLng(QuestionId)

Now we'll loop over all of the child elements within each question's containing div tag:

'get a list of all of the parts of this question,

'and loop over them

Set QuestionFields = Question.all

For Each QuestionField In QuestionFields

For each element, extract its details (either the integer number of votes cast, the integer number of views or the name of the author):

'if this is the question's votes, store it (get rid of any surrounding text)

If QuestionField.className = "votes" Then

votes = Replace(QuestionField.innerText, "votes", "")

votes = Replace(votes, "vote", "")

Cells(RowNumber, 2).Value = Trim(votes)

End If

'likewise for views (getting rid of any text)

If QuestionField.className = "views" Then

views = QuestionField.innerText

views = Replace(views, "views", "")

views = Replace(views, "view", "")

Cells(RowNumber, 3).Value = Trim(views)

End If

'if this is the bit where author's name is ...

If QuestionField.className = "started" Then

'get a list of all elements within, and store the

'text in the second one

Set QuestionFieldLinks = QuestionField.all

Cells(RowNumber, 4).Value = QuestionFieldLinks(2).innerHTML

End If

Next QuestionField

Time now to finish this question, increase the spreadsheet row count by one and go on to the next question:

'go on to next row of worksheet

RowNumber = RowNumber + 1

End If

Set html = Nothing

Finally, we'll tidy up the results and put a title in row one:

'do some final formatting

Range("A3").CurrentRegion.WrapText = False

Range("A3").CurrentRegion.EntireColumn.AutoFit

Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter

Range("A1:D1").Merge

Range("A1").Value = "StackOverflow home page questions"

Range("A1").Font.Bold = True

Application.StatusBar = ""

MsgBox "Done!"

End Sub

And that's the complete macro!

As the above shows, website scraping can get quite messy. If you're going to be doing much of this, I recommend learning about the HTML DOM (Document Object Model), and taking advantage of this in your code.

If you've learnt something from this blog and wonder how much more we could teach you, have a look at our online and classroom VBA courses.

Parts of this blog
Two ways to get data from websites using Excel VBA Extracting a table of data from a website using a VBA query Scraping a website HTML in VBA (this blog)

Some other pages relevant to the above blogs include:

This blog has 9 threads

Add a new post

From:	tanny
When:	12 Jun 18 at 18:13

I have been trying to follow your videos and extract data from an HTML file and I was able to do so on my system. BUT when I try to use the same code/module to extract data from the html file from my Friend's system it shows me an "error 91: object variable or with block not set" at line 2 and sometime even at line 1. Basically it is not reading the html document on a different system. The HTMLButton1 and HTMLButton2 when tried in debug.print show nothing

Public Type MyType

date As String
protocol As String

End Type

sub main()

Dim file As String
Dim IE As New SHDocVw.InternetExplorerMedium
Dim HTMLDoc As MSHTML.HTMLDocument
Dim HTMLButton1
Dim HTMLButton2

file = "C:\temp\HTML\page1.html"

IE.Visible = False
IE.navigate file 'line 1
Do While IE.Busy = True
Loop

Set HTMLDoc = IE.Document
Set HTMLButton1 = HTMLDoc.elementFromPoint(338, 462)
Set HTMLButton2 = HTMLDoc.elementFromPoint(314, 490)

  array_objects(index).date = HTMLButton1.innerText 'line2
array_objects(index).protocol = HTMLButton2.innerText

IE.Quit
Set IE = Nothing

end sub

From:	tanny
When:	13 Jun 18 at 17:02

Hi Andrew,

Thank you so much for your quick response !!

We both are using the same IE versions and same screen resolution, the destination system is not able to parse the html document due to some reason.As a result I have decided to try reading the html document as a text file without using IE, hopefully that should work !!

Really appreciate your help.

From:	Andrew G
When:	14 Jun 18 at 07:10

No problem! Sorry that the suggestions didn't work - it's frustrating to not know why!

I hope you have more success with your workaround.

From:	Andrew G
When:	13 Jun 18 at 07:18

Hi, I'm not familiar the ElementFromPoint method but, after reading the documentation here, I wonder if it's due to a difference in the configuration of your friend's display. Are they using the same screen resolution as your machine?

There is also some suggestion here that the version of IE being used can affect how the method works.

I hope that helps!

From:	vdiniz
When:	11 Jan 18 at 16:55

Hello Andy,

Thank you so much for this great and well explained post.

I have one question: When I tried to implement your code, I only managed to make it work by changing the following line, as bellow:

Set Questions = QuestionList.all (instead of Set Questions = QuestionList.Children)

Do you know why can't I get access to the elements using the Children property?

Thanks in advance,

Diniz

From:	Andy B
When:	12 Jan 18 at 08:30

I don't, sadly! Anyone else have any light to throw on the subject?

From:	A Tong
When:	23 May 19 at 12:10

First, thanks for the code. It is very useful.

Second, I experienced the same issue. The cause is that i found there is one extra <div> tag after "question-mini-list" from my browser result. Maybe the stackoverflow website is updated or due to different version of browser.

I solved it by getting the children of the children (something like QuestionList.Children.Item(0).Children).

QuestionList.All is also working but the "for" loop will have much more cycles as "all" will drill down the children/grandchildren of element and treats them as independent.

From:	KishorePalli
When:	17 Nov 17 at 10:23

I want to fetch a detail from a webpage using Excel macros. When I click a button in the web page, It opens a new tab, there I will have the detail to be fetched. But my macro is clicking the button in the current web page - it is not going to the next newly opened tab. I've published the full code on StackOverflow.

From:	BYIB
When:	26 Jun 17 at 04:32

Many thanks for a great post which I've now successfully modified to scrape 2 sites I need data from. I can get the innertext but am struggling to get some additional data like an href giving the URL.

In the past, I have copied the HTML into a spreadsheet and parsed the html lines there to get at what I want. I now want to automate this hence my interest in your post. Two questions:

Downloading the HTML of one line so I can manually parse it

Using the element.innerText, I can extract the innerText from the following but what I want to do is to get the URL and each piece of innerText individually to put in fields in Excel:

<h3 class="title">
<a title="I need a freelance data entry &amp; admin " class="job js-paragraph-crop" data-height="65" href="https://www.peopleperhour.com/job/i-need-a-freelance-data-entry-admin-1619501">I need a freelance data entry & admin </a>
<span class="job-etiquettes">
<span class="etiquette orange">Urgent</span>
/span>
</h3>
<ul class="clearfix member-info horizontal crop">
<li class="hidden-phone">
<i class="fpph fpph-clock-wall"></i>
<span class="hidden-xs">Posted</span>
<time class="crop value" title="26 June 2017">3 hours ago</time>
</li>
<li class="hidden-xs job-location crop js-tooltip" title="The Job can be done remotely from any location">
<i class="fpph fpph-location"></i>
<span class="">Remote</span>
</li>
<li class="hidden-phone">
<i class="fa fa-dot-circle-o"></i>Proposals<span class="value proposal-count">19</span>
</li>

The innerText shows:

Title, Urgent, Time ago posted, Remote, # proposals, etc.

Is there a command I can use to get all the source html text from the "<a" line so I can parse it myself or can you tell me how to access the individual fields including the URL?

Downloading the Full HTML into Excel

A I mentioned, I can manipulate HTML once I paste it into Excel. If I cannot easily do the above, is there a way of pasting all the HTML from a webpage into Excel so I can use my manual methods?

Many thanks in advance for your help; I have spent many hours trawling the Internet for answers but couldn't find anything!

Nick

From:	Andy B
When:	26 Jun 17 at 09:29

Parsing HTML in VBA is a bit tedious. To start you off, here is some code to read through HTML to pick out the hyperlinks:

Sub ReadHtml()

    'part of the HTML containing <a> tag
    Const Html As String = "<h3 class=""title""><a title=""I need a freelance data entry &amp; admin "" class=""job js-paragraph-crop"" data-height=""65"" href=""https://www.peopleperhour.com/job/i-need-a-freelance-data-entry-admin-1619501"">I need a freelance data entry & admin </a>"

    Const AngleStart As String = "<"
    Const AngleEnd As String = ">"

    Dim StartAngle As Integer
    Dim EndAngle As Integer

    StartAngle = InStr(1, Html, AngleStart)

    'having found first angle bracket, keep going till find last one
    Do Until StartAngle <= 0

        If LCase(Mid(Html, StartAngle + 1, 1)) = "a" Then

            'this is a hyperlink
            EndAngle = InStr(StartAngle + 1, Html, AngleEnd)

            Debug.Print Mid(Html, StartAngle, EndAngle - StartAngle + 1)

        End If

        'find next tag
        StartAngle = InStr(StartAngle + 1, Html, AngleStart)

    Loop

End Sub

What you then need to do is to take the hyperlink returned, and do a similar thing on that to look through for the double-quote symbols and find the attributes of the <a> tag. So I haven't given the answer, just a start.

With regard to the second question, you could copy the HTML to the clipboard, go to a cell in a worksheet and run:

ActiveSheet.Paste

However, you're probably better off creating a single string variable, joining all of the lines returned from the HTML together into a single string of text, then storing this in the variable. You can then parse the text that this variable contains.

From:	demha
When:	09 Jun 17 at 13:18

Hi! First of all, thanks so much for your post. I have next to no experience with VBA or any coding really and I've found this very useful as I try to build up an excel sheet. I haven't gotten my version to work yet, mainly I'm trying to loop through a list with a ul tag with li containers but cant seem to get it to work. Im trying to pull the tips for each horse racing meeting from this site but am unable to. Any guidance would be very appreciated!

From:	Andy B
When:	09 Jun 17 at 16:01

Thanks for your kind comments. No particular tips I'm afraid! For you (or anyone else) who does this sort of thing for a living, I would consider writing the code in Visual Basic in .NET, using Windows Forms. You'd have to download Visual Studio Community edition (which is free), and there is a LONG learning curve, but you can then add in a reference to something called the HTML Agility Pack, which makes it easier to parse HTML tags.

For anyone who hasn't got a spare month or two to learn how to program in .NET, just keep working at it and everything will gradually become easier!

From:	Rondiman
When:	27 Apr 17 at 18:57

Is there a way to scrape pages when you have to click something to get access? It's beyond web scraping, but i would like for example to enter a user name and a password, display a menu, select an item, and finally get access to the page with the information i want to scrape. (Even though i don't know how to code it, i heard about Selenium and Beautiful soup for Python). So, could this be done from vba excel?

Also, where can i find a reference to the library Internet Object Control and Internet HTML, methods, functions, etc please?

From:	Andy B
When:	27 Apr 17 at 21:31

It would be difficult - I think impossible - to initiate a chain of actions like this using VBA. The problem is that interaction with a web page has two parts: server and client. Typically when you request a web page, the web server creates a page of HTML and sends this to your client computer, where the browser (Firefox, Chrome, IE, etc) presents it as a web page.

At this point there is no remaining link with the web server, and the only way you can interact with the web page is by running client script (typically using JavaScript or a derivative language called JQuery). However, there's no way to get VBA to run JavaScript on a web page to mimic user interaction.

With regard to your other post, I don't know of any library of functions, but I'm sure Googling it would throw one up.

From:	LamarW
When:	24 Feb 17 at 00:26

I'm using VBA for Office. Mostly from MSAccess 2016.

I've been trying to scrape some info from a screen. The info is/are all the items in a dropdown list box. When I 'Inspect Element' I can see this:

<select name="selectfilter" size="1">
<option style="color: fuchsia; font-weight: bold;" value="0">All Calls

(lines omitted)

<option style="color: black; font-weight: normal;" value="30">CE 0503 Open
</option></select>

However, I still haven't figured out how to retrieve the options between the <select> and </select> tags.

I think I know how to use the info after I get it. Any help would be very much appreciated!!

A newbie on this site

From:	Andy B
When:	24 Feb 17 at 09:05

I don't know the answer to your question, but here's a thought. The dropdown list has no id and no class name, so it's hard to identify. However, it will be contained within an outer item which WILL have an id. You could always locate this (adapting the code in the blog), then loop over the elements within it to find the SELECT tag you want.

I realise that's only the outline of a solution; can anyone else improve on this?

From:	misiek44
When:	12 Feb 17 at 22:15

I am trying to pull just one row of data from an HTML table. Using Data- > Get External Data -> From Web option is very tedious, there are over 30 individual tables. These tables reside on my pc. I can open them in a web browser, but I do not how to parse them since they are not located on the web. I am trying to automate the process to bring the data into Excel. So far, I have not found any info relating to this specifc case. I would appreciate any info on this subject.

From:	Andy B
When:	13 Feb 17 at 14:23

I think you're probably reading the wrong post! If your data are in HTML format, the best thing to do would be to write a macro to open each file, and save it again, but this time in Excel format. You'd then be able to write a macro to loop over the Excel files, getting the first row of each.

If, on the other hand, your data's already in CSV or even XLS format, you can just write a VBA macro to loop over all of the files opening each in turn. This is beyond the scope of this reply, although you might find this video useful.

From:	misiek44
When:	14 Feb 17 at 09:42

Thanks Andy. Ithink I have the solution. I have recorded macro when I used the Data->Get External Data->From Web and looked at the code I have noticed that

it uses QueryTables.Add(Connection:="URL;file:///L:/T420/F/01%20(2).htm". Since the local files I have are in the HTM format (very basic structure) the query recognizes that. So, all I have to do is put those tables in an array, use loop and I,m in business. Once again thank you very much for Your help.

Cheers.

From:	jsimpson523
When:	03 Nov 16 at 01:27

"You could now parse the HTML using the Document Object Model (for those who know this),..."

Anyone know where I can find more information about this?

From:	Andrew G
When:	03 Nov 16 at 07:46

Hi, here's the link to MIcrosoft's documentation for the Document Object Model

https://msdn.microsoft.com/en-us/library/hh772384(v=vs.85).aspx

Hope that helps!

Add a new post

ALL BLOGS

VBA BLOGS

VBA MACROS BLOGS

Scraping a website HTML in VBA

Step 1 - Defining the problem

Step 2 - Analysing the target website

Step 3 - Referencing the required applications

Step 4 - Getting at the underlying HTML

Step 5 - Parsing the HTML

Head office

London

Manchester