In our last article, we analyzed backlinks using data from Ahrefs.
This time, using the same Ahrefs data source for comparison, we included competitor backlinks in our analysis.
As before, we defined the SEO value of a site’s backlinks as a product of quality and quantity.
Quality is the domain authority (or Ahrefs equivalent domain rating) and quantity is the number of referring domains.
Again, we use the available data to assess link quality before assessing quantity.
time to code.
import re import time import random import pandas as pd import numpy as np import datetime from datetime import timedelta from plotnine import * import matplotlib.pyplot as plt from pandas.api.types import is_string_dtype from pandas.api.types import is_numeric_dtype import uritools pd.set_option('display.max_colwidth', None) %matplotlib inline
root_domain = 'johnsankey.co.uk' hostdomain = 'www.johnsankey.co.uk' hostname="johnsankey" full_domain = 'https://www.johnsankey.co.uk' target_name="John Sankey"
Data import and cleaning
I have set up a file directory to read multiple Ahrefs exported data files in one folder. This is much faster, less tedious, and more efficient than reading each file individually.
Especially when there are 10 or more!
ahrefs_path="data/"
You can use the OS module’s listdir( ) function to list all files in a subdirectory.
ahrefs_filenames = os.listdir(ahrefs_path) ahrefs_filenames.remove('.DS_Store') ahrefs_filenames File names now listed below: ['www.davidsonlondon.com--refdomains-subdomain__2022-03-13_23-37-29.csv', 'www.stephenclasper.co.uk--refdomains-subdoma__2022-03-13_23-47-28.csv', 'www.touchedinteriors.co.uk--refdomains-subdo__2022-03-13_23-42-05.csv', 'www.lushinteriors.co--refdomains-subdomains__2022-03-13_23-44-34.csv', 'www.kassavello.com--refdomains-subdomains__2022-03-13_23-43-19.csv', 'www.tulipinterior.co.uk--refdomains-subdomai__2022-03-13_23-41-04.csv', 'www.tgosling.com--refdomains-subdomains__2022-03-13_23-38-44.csv', 'www.onlybespoke.com--refdomains-subdomains__2022-03-13_23-45-28.csv', 'www.williamgarvey.co.uk--refdomains-subdomai__2022-03-13_23-43-45.csv', 'www.hadleyrose.co.uk--refdomains-subdomains__2022-03-13_23-39-31.csv', 'www.davidlinley.com--refdomains-subdomains__2022-03-13_23-40-25.csv', 'johnsankey.co.uk-refdomains-subdomains__2022-03-18_15-15-47.csv']
Once the files are listed, use a for loop to read each file individually and append them to the dataframe.
While reading the file, use string manipulation to create a new column with the site name of the data you want to import.
ahrefs_df_lst = list() ahrefs_colnames = list() for filename in ahrefs_filenames: df = pd.read_csv(ahrefs_path + filename) df['site'] = filename df['site'] = df['site'].str.replace('www.', '', regex = False) df['site'] = df['site'].str.replace('.csv', '', regex = False) df['site'] = df['site'].str.replace('-.+', '', regex = True) ahrefs_colnames.append(df.columns) ahrefs_df_lst.append(df) ahrefs_df_raw = pd.concat(ahrefs_df_lst) ahrefs_df_raw
Images by Ahrefs, May 2022
Now the raw data for each site has been combined into a single dataframe. The next step is to organize the column names to make them easier to work with.
You could use custom functions or list comprehensions to eliminate the repetition, but for novice SEO Pythonistas, it’s a good idea to see what’s going on step by step. As they say, “repetition is the mother of mastery” so practice!
competitor_ahrefs_cleancols = ahrefs_df_raw competitor_ahrefs_cleancols.columns = [col.lower() for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace(' ','_') for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace('.','_') for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace('__','_') for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace('(','') for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace(')','') for col in competitor_ahrefs_cleancols.columns] competitor_ahrefs_cleancols.columns = [col.replace('%','') for col in competitor_ahrefs_cleancols.columns]
Having a count column and a single value column (“project”) is useful for grouping and aggregation operations.
competitor_ahrefs_cleancols['rd_count'] = 1 competitor_ahrefs_cleancols['project'] = target_name competitor_ahrefs_cleancols

Now that the columns have been cleaned up, it’s time to clean up the row data.
competitor_ahrefs_clean_dtypes = competitor_ahrefs_cleancols
For the reference domain, we replace hyphens with zeros and set the data type as integer (that is, integer).
This is repeated for linked domains as well.
competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] == '-', 0, competitor_ahrefs_clean_dtypes['dofollow_ref_domains']) competitor_ahrefs_clean_dtypes['dofollow_ref_domains'] = competitor_ahrefs_clean_dtypes['dofollow_ref_domains'].astype(int) # linked_domains competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = np.where(competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] == '-', 0, competitor_ahrefs_clean_dtypes['dofollow_linked_domains']) competitor_ahrefs_clean_dtypes['dofollow_linked_domains'] = competitor_ahrefs_clean_dtypes['dofollow_linked_domains'].astype(int)
first look indicates the date point at which the link was found. It can be used for plotting time series and deriving link ages.
Convert to date format using to_datetime function.
# first_seen competitor_ahrefs_clean_dtypes['first_seen'] = pd.to_datetime(competitor_ahrefs_clean_dtypes['first_seen'], format="%d/%m/%Y %H:%M") competitor_ahrefs_clean_dtypes['first_seen'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.normalize() competitor_ahrefs_clean_dtypes['month_year'] = competitor_ahrefs_clean_dtypes['first_seen'].dt.to_period('M')
To calculate link_age , subtract the date first seen from today’s date and convert the difference to a number.
# link age competitor_ahrefs_clean_dtypes['link_age'] = dt.datetime.now() - competitor_ahrefs_clean_dtypes['first_seen'] competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age'] competitor_ahrefs_clean_dtypes['link_age'] = competitor_ahrefs_clean_dtypes['link_age'].astype(int) competitor_ahrefs_clean_dtypes['link_age'] = (competitor_ahrefs_clean_dtypes['link_age']/(3600 * 24 * 1000000000)).round(0)
The target column helps distinguish between “client” sites and competitors which will help you visualize them later.
competitor_ahrefs_clean_dtypes['target'] = np.where(competitor_ahrefs_clean_dtypes['site'].str.contains('johns'), 1, 0) competitor_ahrefs_clean_dtypes['target'] = competitor_ahrefs_clean_dtypes['target'].astype('category') competitor_ahrefs_clean_dtypes

Now that your data has been cleaned up in terms of both column titles and row values, you’re ready to set it up and start analyzing.
link quality
Start with link quality. It accepts Domain Rating (DR) as a basis.
Let’s start by examining the distributional properties of DR by plotting the distribution using the geom_bokplot function.
comp_dr_dist_box_plt = ( ggplot(competitor_ahrefs_analysis.loc[competitor_ahrefs_analysis['dr'] > 0], aes(x = 'reorder(site, dr)', y = 'dr', colour="target")) + geom_boxplot(alpha = 0.6) + scale_y_continuous() + theme(legend_position = 'none', axis_text_x=element_text(rotation=90, hjust=1) )) comp_dr_dist_box_plt.save(filename="images/4_comp_dr_dist_box_plt.png", height=5, width=10, units="in", dpi=1000) comp_dr_dist_box_plt

The plot compares the statistical properties of the sites side by side, most notably the interquartile range showing where the most referring domains fall in terms of domain ratings.
It also has the fourth highest median domain rating for John Sankey, which compares well with the quality of its links when compared to other sites.
William Garvey shows that the range of DRs is the most diverse compared to other domains, with slightly relaxed criteria for link acquisition. who knows.
link volume
That’s quality. What about the number of links from the referring domain?
To tackle this, we use the groupby function to compute a running total of the reference domain.
competitor_count_cumsum_df = competitor_ahrefs_analysis competitor_count_cumsum_df = competitor_count_cumsum_df.groupby(['site', 'month_year'])['rd_count'].sum().reset_index()
Extensions allow the calculation window to grow according to the number of rows, which is how a running total is achieved.
competitor_count_cumsum_df['count_runsum'] = competitor_count_cumsum_df['rd_count'].expanding().sum() competitor_count_cumsum_df

The result is a data frame with site, month_year, and count_runsum (running total), perfect format for feeding charts.
competitor_count_cumsum_plt = ( ggplot(competitor_count_cumsum_df, aes(x = 'month_year', y = 'count_runsum', group = 'site', colour="site")) + geom_line(alpha = 0.6, size = 2) + labs(y = 'Running Sum of Referring Domains', x = 'Month Year') + scale_y_continuous() + scale_x_date() + theme(legend_position = 'right', axis_text_x=element_text(rotation=90, hjust=1) ))
competitor_count_cumsum_plt.save(filename="images/5_count_cumsum_smooth_plt.png", height=5, width=10, units="in", dpi=1000) competitor_count_cumsum_plt

The plot shows the number of referring domains for each site since 2014.
I find it very interesting that different sites start getting links differently.
For example, William Garvey started with over 5,000 domains. I want to know where the PR company is!
You can also see the growth rate. Hadley Rose, for example, started earning the rink in 2018, but he really started moving in mid-2021.
more more more
You can always do more scientific analysis.
For example, one direct and natural extension of the above is to combine both quality (DR) and quantity (volume) to get a more holistic view of how sites compare in terms of offsite SEO. .
Other extensions model the quality of referring domains on both your site and your competitors’ sites so that what link features (such as word count and link content relevance) can explain the difference in visibility between yours and your competitors’ is to check whether .
This model extension will be a great application of these machine learning techniques.
Other resources:
Featured Image: F8 Studio/Shutterstock
var s_trigger_pixel_load = false; function s_trigger_pixel(){ if( !s_trigger_pixel_load ){ striggerEvent( 'load2' ); console.log('s_trigger_pix'); } s_trigger_pixel_load = true; } window.addEventListener( 'cmpready', s_trigger_pixel, false);
window.addEventListener( 'load2', function() {
if( sopp != 'yes' && !ss_u ){
!function(f,b,e,v,n,t,s) {if(f.fbq)return;n=f.fbq=function(){n.callMethod? n.callMethod.apply(n,arguments):n.queue.push(arguments)}; if(!f._fbq)f._fbq=n;n.push=n;n.loaded=!0;n.version='2.0'; n.queue=[];t=b.createElement(e);t.async=!0; t.src=v;s=b.getElementsByTagName(e)[0]; s.parentNode.insertBefore(t,s)}(window,document,'script', 'https://connect.facebook.net/en_US/fbevents.js');
if( typeof sopp !== "undefined" && sopp === 'yes' ){ fbq('dataProcessingOptions', ['LDU'], 1, 1000); }else{ fbq('dataProcessingOptions', []); }
fbq('init', '1321385257908563');
fbq('track', 'PageView');
fbq('trackSingle', '1321385257908563', 'ViewContent', { content_name: 'competitor-backlinks-python', content_category: 'linkbuilding marketing-analytics seo' }); } });