Using machine learning and a random forest classifier as a step toward more efficient SEO management.
Four years ago, with the algorithm codenamed Penguin, Google introduced a targeted method of identifying websites that have intentionally engaged in unnatural link building tactics or unintentionally fallen victim to inclusion in webspam indices. When this algorithm (distinct from Google’s core search) is run periodically, such websites are hit with a ranking penalty that lowers their position in SERPs. From an SEO standpoint, Penguin has motivated webmasters to conform to Google’s quality guidelines by analyzing their sites’ inbound links and disavowing any disreputable domains.
The SEO industry is now preparing for another significant update as, with the imminent rollout of Penguin 4.0, Google is expected to implement real-time search ranking penalties for spammy inbound links. Whereas auditing inbound links and disavowing disreputable domains from time to time used to be a good idea, it’s about to become a vital thing to stay on top of all the time!
Here at Adpearance, we provide SEO management as part of our full array of digital marketing solutions. Knowing firsthand how inefficient and time-consuming manual inbound link audits can be, we asked the question:
Can we automatically detect when a website experiences an influx of spammy links pointing to it?
If so, we would know exactly when and what sites are at risk of an undesirable dip in Google search rankings due to real-time Penguin penalties, and could react accordingly by immediately updating those sites’ domain disavowal files in Google Search Console.
Google is understandably reticent about the particulars of its webspam classification algorithm. However, as SEO professionals and internet marketing experts, we have a good idea of what a disreputable domain looks like. We can often tell at a glance if a page contains relevant and reputable content versus, for example, a long list of outbound links designed to manipulate search rankings. The visual check is quick and easy; it’s just that these “glances” add up when there are thousands upon thousands of pages to review.
In order to speed up the process of inbound link monitoring and analysis, we are teaching a computer to do the work for us. Based on the results of our past inbound link audits, we have compiled a database of URLs that are categorized as either “spam” or “not spam.” A rudimentary first step in computer-assisted spammy link detection would be to check if a new link comes from a domain that is already in our database; if so, this means we already know its reputability, so we can appropriately tag it as “spam” or “not spam” and move on to the next new link. This can certainly save some time, as there are numerous well-known spam domains that show up repeatedly in inbound link analyses. But what about all the links/domains that we’ve never encountered before? This is where machine learning comes into play.
Rather than simply comparing a new list of links against the existing list in our database, we use that existing list as a training set for a random forest classifier that makes a logical prediction about whether the links in the new list are spammy or not. Forcing a computer to rely on visual cues in judging the reputability of a given URL would involve prohibitive computational overhead, so instead we provide it with a list of features or metrics corresponding to each URL (such as Moz Domain Authority) that we believe to be correlated with reputability. It’s not necessary to know in advance how strong such a correlation is – one of the beauties of machine learning is that the computer figures this out on its own!
Are you ready for the real-time Penguin update? Contact us to learn more about Adpearance’s innovative approach to SEO solutions.