WiseOwl Training - Established 1992 Wise Owl Training

Established May 1992
30 years in business
Wise Owl Training
30 years in business
See 519 reviews for our classroom and online training
Extracting external links from websites using Power BI
Did you know that you could use Power BI as a tool to scrape websites, using the Html.Table M function? We didn't either until now!

Posted by Andy Brown on 20 April 2022

You need a minimum screen resolution of about 700 pixels width to see our blogs. This is because they contain diagrams and tables which would not be viewable easily on a mobile phone or small laptop. Please use a larger tablet, notebook or desktop computer, or change your screen resolution settings.

Extracting external links from websites using Power BI

Did you know that you can scrape websites using Power BI Desktop? 

List of links

The start of a list of all of the external links in a couple of websites.

It's amazing what we learnt while writing our new Advanced Power BI (Data) course!

The example below uses some pages from a publicly available demo website, but it would be much more fun to get a list of the external sites linked to by websites controlled by your employer, your competitors or your friends!

Step 1 - Get a list of websites

The first thing to do is to create a list of websites to visit in some form.  Here's the one we'll use:

List of websites

If you're short of time, you could always download the above Excel workbook, allowing you to follow along.

Load this into Power BI:

Query in Power BI

The table as it appears in Power Query, after promoting the header row.

Note that you may need to confirm your credentials and privacy settings to load this data.

Step 2 - Getting the HTML

Now add a custom column in Power Query to get the HTML data for each website:

Adding a custom column

The tool to add a custom column.

 

You can use the Web.BrowserContents M function to return the HTML from each web page:

Web.BrowserContents function

This will give the HTML for each website visited.

This should return something like this:

HTML for each page

The HTML for each page - it may take a short while for this to appear, as Power Query is having to visit each site.

Step 3 - Getting the Links

You can now create a new column to use the Html.Table function to parse this HTML to extract just the HTML a tags (ie the hyperlinks):

Getting the links

See below for more on the syntax of the HTML.Table function.

You can then expand the results:

Expanding a table

Click on the icon as shown above to expand the table of results for each page.

For our example, here's the start of what this would give:

The links

The start of the links for each of our 3 web pages.

 

Step 4 - Getting only external links

The Html.Table function takes two main arguments:

Argument What it contains For our example
1 The name for the column you'll create "Page links"
2 A JQuery-style selector for what to show "a"

If you don't know JQuery you'll still be able to use and make sense of the following, but if you don't know HTML at all you may struggle!

Change the second argument to the following to show only those links whose href attribute begins with the letters http (ie the external links):

"a[href^=""http""]"

This should give the following for our example:

List of external links

You'd normally expect to see a host of different external links against each web page, but these are demo sites and so not typical (the first page doesn't have any external links at all!).

You can download the Power BI report containing this here (I've replaced the Excel data source for this to be a table of data, so the report should work as it stands). 

This blog has 0 threads Add post