You’ve probably used one of the popular tools like Ahrefs or Semrush to analyze your site’s backlinks.
These tools trawle the web to get a list of sites that link to your website, get domain ratings and other data about the quality of your backlinks.
It’s no secret that backlinks play a big role in Google’s algorithms. So it makes sense to at least get to know your site before comparing it to your competitors.
While tools can give you insight into specific metrics, learning how to analyze your backlinks yourself gives you more flexibility in what you measure and how you view it.
Most analysis can be done in a spreadsheet, but Python has some advantages.
Besides the huge number of rows it can handle, it also makes it easier to examine statistical aspects such as distributions.
This column provides step-by-step instructions on how to visualize basic backlink analysis and customize reports using Python to account for different link attributes.
do not sit
I’ve chosen a small UK furniture sector website as an example to illustrate some basic analysis using Python.
So what value do site backlinks have for SEO?
Simply put, quality and quantity.
To experts, quality is subjective, but to Google, it is determined by metrics such as authority and content relevance.
Start by assessing link quality using available data before assessing quantity.
time to code.
import re import time import random import pandas as pd import numpy as np import datetime from datetime import timedelta from plotnine import * import matplotlib.pyplot as plt from pandas.api.types import is_string_dtype from pandas.api.types import is_numeric_dtype import uritools pd.set_option('display.max_colwidth', None) %matplotlib inline root_domain = 'johnsankey.co.uk' hostdomain = 'www.johnsankey.co.uk' hostname="johnsankey" full_domain = 'https://www.johnsankey.co.uk' target_name="John Sankey"
Start by importing the data and cleaning up the column names to make them easier to work with and quick to fill in later stages.
target_ahrefs_raw = pd.read_csv( 'data/johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv')
List comprehensions are a powerful and less focused way to clean up column names.
target_ahrefs_raw.columns = [col.lower() for col in target_ahrefs_raw.columns]
The list comprehension tells Python to convert the column names for each column (‘col’) in the columns of the dataframe to lowercase.
target_ahrefs_raw.columns = [col.replace(' ','_') for col in target_ahrefs_raw.columns] target_ahrefs_raw.columns = [col.replace('.','_') for col in target_ahrefs_raw.columns] target_ahrefs_raw.columns = [col.replace('__','_') for col in target_ahrefs_raw.columns] target_ahrefs_raw.columns = [col.replace('(','') for col in target_ahrefs_raw.columns] target_ahrefs_raw.columns = [col.replace(')','') for col in target_ahrefs_raw.columns] target_ahrefs_raw.columns = [col.replace('%','') for col in target_ahrefs_raw.columns]
It’s not strictly necessary, but I like to have a count column as a standard for aggregation, and a single value column “project” if I need to group the entire table.
target_ahrefs_raw['rd_count'] = 1 target_ahrefs_raw['project'] = target_name target_ahrefs_raw
-
Panda screenshot, March 2022
Now you have a dataframe with clean column names.
The next step is to clean up the actual table values to make them useful for analysis.
Make a copy of the previous dataframe and give it a new name.
target_ahrefs_clean_dtypes = target_ahrefs_raw
Clear the dofollow_ref_domains column. This will tell you how many referring domains the site you are linking to has.
In this case, convert the dashes to zeros and then cast the entire column as an integer.
# referring_domains target_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(target_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-', 0, target_ahrefs_clean_dtypes['dofollow_ref_domains']) target_ahrefs_clean_dtypes['dofollow_ref_domains'] = target_ahrefs_clean_dtypes['dofollow_ref_domains'].astype(int) # linked_domains target_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(target_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-', 0, target_ahrefs_clean_dtypes['dofollow_linked_domains']) target_ahrefs_clean_dtypes['dofollow_linked_domains'] = target_ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)
First_seen indicates the date the link was first found.
Convert the string to a date format that Python can handle, and use this later to derive the age of the link.
# first_seen target_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(target_ahrefs_clean_dtypes['first_seen'], format="%d/%m/%Y %H:%M")
Converting first_seen to a date also means that you can do time aggregation by month and year.
This is useful as links to your site may not be retrieved every day.
target_ahrefs_clean_dtypes['month_year'] = target_ahrefs_clean_dtypes['first_seen'].dt.to_period('M')
Link age is calculated as today’s date minus first_seen date.
Then convert it to numeric format and divide by the huge number to get the number of days.
# link age target_ahrefs_clean_dtypes['link_age'] = datetime.datetime.now() - target_ahrefs_clean_dtypes['first_seen'] target_ahrefs_clean_dtypes['link_age'] = target_ahrefs_clean_dtypes['link_age'] target_ahrefs_clean_dtypes['link_age'] = target_ahrefs_clean_dtypes['link_age'].astype(int) target_ahrefs_clean_dtypes['link_age'] = (target_ahrefs_clean_dtypes['link_age']/(3600 * 24 * 1000000000)).round(0) target_ahrefs_clean_dtypes

Now that the data types have been cleaned up and some new data features created, the fun begins!
link quality
The first part of the analysis evaluates link quality. It uses the describe function to summarize the entire dataframe and get descriptive statistics for all columns.
target_ahrefs_analysis = target_ahrefs_clean_dtypes target_ahrefs_analysis.describe()

So, from the table above, we can see the average (mean), the number of referring domains (107), and the variation (e.g. 25th percentile).
The average domain rating (corresponding to Moz’s domain authority) for referring domains is 27.
is that a good thing?
It is difficult to know as there is no data for competitors to compare in this market sector. This is where you can use your experience as an SEO practitioner.
But I’m sure we can all agree that it could be higher.
How high to shift is another matter.

The table above can be a bit dry and difficult to visualize, so we plot a histogram to give you an intuitive understanding of the authority of the referring domain.
dr_dist_plt = ( ggplot(target_ahrefs_analysis, aes(x = 'dr')) + geom_histogram(alpha = 0.6, fill="blue", bins = 100) + scale_y_continuous() + theme(legend_position = 'right')) dr_dist_plt

The distribution is highly skewed, indicating that most referring domains have zero authority ratings.
Above zero, the distribution looks fairly even, with the same amount of domains across different levels of authority.
Link age is another important factor in SEO.
Take a look at the distribution below.
linkage_dist_plt = ( ggplot(target_ahrefs_analysis, aes(x = 'link_age')) + geom_histogram(alpha = 0.6, fill="blue", bins = 100) + scale_y_continuous() + theme(legend_position = 'right')) linkage_dist_plt

Since most of the links are new, the distribution looks more normal, even if it is still skewed.
The most common link lifespan is around 200 days, which is less than a year, suggesting that most links were acquired recently.
Interestingly, let’s see how this correlates with domain authority.
dr_linkage_plt = ( ggplot(target_ahrefs_analysis, aes(x = 'dr', y = 'link_age')) + geom_point(alpha = 0.4, colour="blue", size = 2) + geom_smooth(method = 'lm', se = False, colour="red", size = 3, alpha = 0.4) ) print(target_ahrefs_analysis['dr'].corr(target_ahrefs_analysis['link_age'])) dr_linkage_plt 0.1941101232345909

The plot (along with the 0.19 number printed above) shows no correlation between the two.
And why should it be?
Correlation only means that higher authority links were obtained earlier in the site’s history.
The reason for the decorrelation will become clear later.
Next, we will look at link quality over time.
Plotting the number of links literally by date would make the time series pretty cluttered and not very useful, as shown below (no code provided to render the chart).
To achieve this, we calculate a moving average of domain ratings by month.
Note the expand( ) function. This tells Pandas to include all previous lines with each new line.
target_rd_cummean_df = target_ahrefs_analysis target_rd_mean_df = target_rd_cummean_df.groupby(['month_year'])['dr'].sum().reset_index() target_rd_mean_df['dr_runavg'] = target_rd_mean_df['dr'].expanding().mean() target_rd_mean_df

Now we have a table that we can use to feed and visualize our graphs.
dr_cummean_smooth_plt = ( ggplot(target_rd_mean_df, aes(x = 'month_year', y = 'dr_runavg', group = 1)) + geom_line(alpha = 0.6, colour="blue", size = 2) + scale_y_continuous() + scale_x_date() + theme(legend_position = 'right', axis_text_x=element_text(rotation=90, hjust=1) ))
dr_cummean_smooth_plt

This is very interesting. Because the site seems to have originally started (probably his PR campaign to start a business) by collecting high authority links.
After four years of decline, it again gained new high-authority links.
amount of links
Just write that headline!
Who wouldn’t want a ton of (good) links to their site?
Quality is one thing. The next thing to analyze is volume.
As in the previous operation, we use the expansion function to compute the cumulative sum of the links obtained so far.
target_count_cumsum_df = target_ahrefs_analysis target_count_cumsum_df = target_count_cumsum_df.groupby(['month_year'])['rd_count'].sum().reset_index() target_count_cumsum_df['count_runsum'] = target_count_cumsum_df['rd_count'].expanding().sum() target_count_cumsum_df

That’s the data, now the graph.
target_count_cumsum_plt = ( ggplot(target_count_cumsum_df, aes(x = 'month_year', y = 'count_runsum', group = 1)) + geom_line(alpha = 0.6, colour="blue", size = 2) + scale_y_continuous() + scale_x_date() + theme(legend_position = 'right', axis_text_x=element_text(rotation=90, hjust=1) )) target_count_cumsum_plt

The links gained at the beginning of 2017 slowed down, but were steadily added over the next four years until they accelerated again around March 2021.
Again, correlating it with performance is a good thing.
go further
Of course, the above is just the tip of the iceberg, just a simple survey of one site. It’s hard to guess what will help improve your ranking in the competitive search space.
Below are some areas for further data exploration and analysis.
- Adding social media sharing data to both destination URLs.
- Correlate overall site visibility with average DR in action with time.
- Plot distribution of DR with time.
- Add search volume data to hostname As a measure of true authority, look at the number of branded searches your referring domain receives.
- Joining with crawl data Send to a destination URL to test the relevance of your content.
- link speed – The rate at which new links from new sites are acquired.
- Integrate all the above ideas Add to your analysis and compare to your competitors.
I’m sure there are many more ideas that I haven’t listed above. Feel free to share below.
Other resources:
Featured image: metamorworks/Shutterstock
var s_trigger_pixel_load = false; function s_trigger_pixel(){ if( !s_trigger_pixel_load ){ striggerEvent( 'load2' ); console.log('s_trigger_pix'); } s_trigger_pixel_load = true; } window.addEventListener( 'cmpready', s_trigger_pixel, false);
window.addEventListener( 'load2', function() {
if( sopp != 'yes' && !ss_u ){
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'backlink-analysis-using-python', content_category: 'linkbuilding seo' }); } });