Google’s Leak, our Agency Review


Kyle Roof
June 18, 2024


The SEO world is buzzing. We've just been handed a rare, eye-opening peek into the systems supporting Google's two-trillion-dollar empire.

The leak that was heard around the world is easily the biggest yet from an SEO standpoint, and everyone who is anyone in the industry is obliged to share their opinions and conclusions.

In this post, we're diving into our interpretations, but let's be clear: this is speculative territory. We're making educated guesses based on incomplete data. This leak is more about guiding future SEO experiments than revolutionizing current strategies.

The May 2024 Google Leak, In Brief

To set the stage, here are the basics of “the leak”.

What: A massive trove of Google’s internal API documentation was leaked. This included over 2,500 pages and over 14,000 attributes in Google’s systems. It includes API reference documents and data structures.

Who: The leak was initially shared by an anonymous source (later revealed to be Erfan Azimi) and shared most prominently by Rand Fishkin and Mike King.

When: The leak occurred between March and May 2024, with the documents briefly made public before being widely shared and analyzed.

Where: The documents were inadvertently made public on GitHub, and from there, they spread to the HexDocs platform and eventually into SEO communities.

Why: The best guess is that it was an accidental leak during the code review process. However, that hasn’t stopped speculations about whether it was even a leak or an intentional misinformation device.

What The Google Leak Is Not

As expected this leak has attracted the attention of just about every active member of the SEO industry. And it’s being described as a leak of Google’s search algorithm in many places.

Contrary to sensationalist headlines, this isn't a leak of Google's search algorithm. Think of it as a recipe card listing "apple, sugar, flour, and spices" without the actual recipe for the apple pie.

It’s useful information. But it’s not enough to make a pie and certainly not as game-changing (yet) as some big names in the industry are making it out to be.

Second, the information in this leak is not only about search. The API modules in the documentation span across many of Google’s business units. But, it does seem to be mostly about search.

Third, the documentation is not a static snapshot of Google’s systems. In fact, it’s reasonable to assume that some things might be changed as a result of this leak.

Fourth, and lastly, the leak is not a nuclear bomb that will completely change the way SEO is done. There are interesting things in the documentation but upon closer inspection, few of them are completely new. SEO testing groups, and we at High Voltage, have been working with most of this information for years now.

All that said, there is a lot of information in the documentation. Enough to make this leak like a Rorschach test of sorts that will show every person interpreting it what they want to see.

SEOs that specialize in local will see local signals, link-builders will make a ton of assumptions about links, and so on. It’s important to remember that the fields we see in this document are removed from their context in their respective functions and we don’t (except in a handful of cases) have examples of what they are meant to contain. Referring to these fields as ranking factors isn’t accurate, but we can do so in the interest of having a high-level conversation.

Unpacking The Leak’s Potential

The documentation in this leak is organized into modules. Each module is a unit of related functions that have a specific task. The scope of each module depends on what that task is and all of the modules together make up the API.

There’s no guarantee these are all the modules, but there’s plenty here to keep the SEO industry busy for years to come.

Given the breadth of the information, the modules give us a good way to analyze the data. 

The PerDocData Module

To make this analysis manageable, we're zooming in on a single module: PerDocData. Why this one? Partly arbitrarily and partly because we find it especially interesting from an on-page SEO perspective.

Let’s start with some assumptions we’re making.

Given the name, we assume this module processes data about HTML documents (web pages) on a case-by-case basis. That is to say, this module tells the rest of the system important information about a single page. We’re also assuming that these attributes affect ranking — otherwise, there’s no reason to talk about them.

And, we’re assuming that these attributes are dynamic and adaptive so it makes sense to change pages and that those changes will affect the attributes’ values. In other words, we assume on-page SEO works!

Let’s start with a few attributes related to content quality. We’ll refer to these as factors going forward with all the introductory caveats in place.

Content Quality Factors

GibberishScore: There’s something refreshing about seeing a variable name that is so straightforward. 

It’s hard to deduce anything other than Google’s systems having a way to rank how close a text is to nonsensical language. Why would someone want to publish gibberish in the first place?

In all likelihood, this factor is put in place to combat spam. But if you’re using automatically translated content, there’s a chance the translation is low-quality enough to be considered nonsense.

OriginalContentScore: One of the most interesting factors in the entire leak. Maybe it refers to content that isn’t plagiarized maybe it’s something more. If it goes beyond simply checking for duplicated words, what else would constitute originality? There are so many ways to interpret this and will probably lead to a lot of SEO tests.

To add to that, we have a short description of the field stating that it “only applies to pages with little content”. 

One way to interpret this is that if you want to give any page an advantage, make sure it’s not too similar to other pages. This could redefine how we approach unique content.

KeywordStuffingScore: An age-old “factor”, and seeing it spelled out like this makes a case for why SEO testing is so important. There’s a lot of nuance to keyword stuffing, not least of which is that it doesn’t happen in a vacuum.

Put another way, whatever percentage of keywords we think constitutes keyword stuffing, we’re probably wrong at any point in time. The real stuffing score is probably dynamic and adapts to documents that are available in the index. Finding out where the limit is means testing, and adapting strategies constantly.

PremiumData: This factor suggests that there is a “premium” category of documents in Google’s index. What makes a document premium is unclear at this point — paywalled? high PageRank?

Spam Detection Factors

It shouldn’t surprise us that there are a lot of fields that store spam-related information. But uncanny to see just how much effort is dedicated to this and to know that spam content is still such a problem for Google.

DocLevelSpamScore: This overall spam score for the document is represented on a scale of 0 to 127, helping to identify whether the entire document is considered spammy. A high score would presumably indicate a higher likelihood of the document being spam.

It’s unclear exactly what constitutes spam for Google, but this dimension suggests that it’s a pretty broad spectrum so any effort you make on the page level is probably time well spent. Even if it only moves your spam score a few points.

Spamrank: We get two excellent insights in this factor’s description:

The spamrank measures the likelihood that this document links to known spammers. Its value is between 0 and 65535.

First is the obvious we should be careful about what we link to from our content. The other insight is possibly much more valuable and it’s that there is such a thing as “known spammers”. As the leak is scrutinized more, we may figure out how these known spammers are identified.

IsAnchorBayesSpam: Another insight-heavy factor. It’s a true or false factor that classifies the page into the spam category with something called an “anchor Bayes classifier”.

From other parts of the leak, we can speculate this means that too many “spammy” anchor texts in links pointing to the page will qualify it as an anchor-spammy(?) document.

For one thing, that tells us that variety in anchor texts might have a positive impact. The other, and maybe more important, is that Google is using Bayes classifiers (at least here). Bayes classifiers are effective at combating spam because they’re great at making predictions based on what has already happened. They’re not that great at classifying things that haven’t been seen previously.

Takeaway: if you don’t want to appear spammy, mix things up. The more you deviate from things that happened before, the less likely it is that you’ll end up in the spam bucket.

spamCookbookAction: Easily the most confusing factor we’ve found so far. Does Google really care about recipe content? Doubtful. It’s much more likely that ‘cookbook’ is the name of some internal system that’s used to identify spam.

Although, what if it really is about recipes and cookbooks? 🤔

Freshness Factors

We’ve long known in SEO that there is a “freshness” boost of some kind. Now, we’re closer to figuring out what that might mean in a practical sense.

lastSignificantUpdate and lastSignificantUpdateInfo: The big takeaway from these is that there’s probably a cutoff before a page is considered significantly updated. In time, that cutoff will be revealed in testing.

Until then, err on the side of caution when updating your content. You might want to do a bit more than you planned to make sure it’s worthy of the “significant” label.

semanticDate and semanticDateInfo: Very interesting when taken in context with other fields from Navboost-related modules in the leak. From the documentation:

estimated date of the content of a document based on the contents of the document (via parsing), anchors and related documents.

One way to interpret the existence of these fields is that including date information on your page in plain text could impact how fresh Google considers it. For instance, mentioning events, years, seasons, or days.

Maybe more important than the actual fields, is the mention of freshness twiddlers in this module. Twiddlers are present in many parts of the documentation. Their exact purpose is unclear but the best guess so far is that twiddlers are functions that adjust (twiddle?) and fine-tune search results after or in unison with big-impact factors like PageRank.

The presence of freshness twiddlers means that freshness does play a role in how a page ranks.

Miscellaneous Interesting Factors

The PerDocData module is full of factors that no one is really sure what to do about. As with the Yandex leak, we’ll still be discovering new things about this one for years. But in the meantime, there are some factors that are just interesting to think about.

commercialScore: The docs state that this is a measure of the “commerciality” of a a page. That could mean selling something on a page puts it in a whole different category from other pages.

scienceDoctype: Scholarly and/or scientific pages seem to be put into a special bucket. It’s hard to say whether this is good or bad, but it’s not too great of a leap to speculate that having a high proportion of science docs might positively impact EEAT scores on a site.

ymylHealthScore and ymylNewsScore: YMYL pages are scored by ML classifiers somehow. It’s worth spending a lot of time figuring out how.

domainAge: Interesting because for the longest time, mainstream SEO has denied that domain age makes any difference at all. If it doesn’t matter (and maybe it doesn’t) why is the application wasting memory keeping track of it?

What’s Comes Next

With information of this magnitude, I believe we can expect SEO professionals to split into one of three categories going forward.

The first I’ll call the collective shruggers. And I think most will fall into this category of SEOs who look at the reporting on the leak, shrug their shoulders, and carry on like nothing much happened.

Frankly, that’s not a bad place to be. If what you’re doing is working, why waste energy changing it?

The second category is the overconfident optimists. These SEOs will dig into the documentation, take a lot of it at face value, and confidently start devising “leak-based” strategies for their clients and themselves. Some of these might actually work, but my guess is that most will not perform any better than traditional SEO.

The third category will pragmatically start testing insights from the leak. This is the category that we will fall into at High Voltage, and ultimately the one I think will stand to gain the most from the leak.

This article just scratched the surface of the surface of one module of the leak. We’ll keep exploring interesting parts of the leak documentation in future posts so sign up for our newsletter to get notified when that happens. 

About the author 

Kyle Roof

Kyle is responsible for the development and implementation of all SEO techniques used by the SEO agency High Voltage SEO and the SEO tool PageOptimizer Pro. Kyle is also the co-founder of Internet Marketing Gold, a global community of 3000+ SEO professionals who test and prove cutting edge SEO techniques. Kyle is also co-host of SEO Fight Club a weekly YouTube show that covers a multitude of SEO topics. Kyle’s SEO techniques and discoveries are followed by many SEO professionals and business leaders, he has been featured in many respected publications and is a regular speaker on SEO and SEO testing at conferences throughout the world.

You may also like

Diving Into GuesSEO

Diving Into GuesSEO

What Are You Waiting For?

Competition increases every day you delay action.