It’s out in the open. Internal documentation for Google’s Content Warehouse API has leaked online. We don’t know whether everything here is still relevant, because there are some modules and attributes marked as deprecated. But what we do know is that this was part of Google’s internal documents at one point or another.
This will be my analysis of various modules and attributes that caught my eye. I won’t cover everything but it will cover a lot of the ones that I find interesting and could be impactful.
If you’ve been following best practices with your SEO, you may find yourself saying “great, cool, I’m doing that” and maybe a “well, I didn’t learn anything new.” That’s a good thing! That means many of the things you’re already focused on serving up high-quality content to your users.
I will continue to update the article with newer insights as I go down the rabbit hole. There are many modules and attributes on NLP that I’ve barely touched.
The version I’m looking at is the Google API Content Warehouse version 0.4.0. It looks like it was actually leaked onto Github on March 13, 2024 by the yoshi-code-bot and an anonymous ex-Googler discovered the documents.
The first two to publicly report on this were Mike King at iPullRank and Rand Fishkin at SparkToro. They offer a lot of history and insights, especially Rand’s history of user signal tests that Google staff would criticize and say were “made up crap.” Now we know who also gave the leaked documents to Rand Fishkin – Erfan Zimi put out a public statement.
Disclaimer + AI Warning When Reading Other Analyses
There are 2,596 modules and 14,014 attributes, but it doesn’t mean they’re all related to ranking. For ones that do reference ranking and indexing systems, they’ll be more relevant.
The exact weight and current usage of these ranking features are still a mystery. But what we do know is that they were relevant enough to be included at some point into the documentation.
Wanted to also give a fair bit of warning when reading anything on this documentation from me or anyone else. Use your own brain and analysis to determine what you think is and isn’t relevant.
Now that we have ChatGPT, Claude, and other chatbots, many people will try to put this document into a GPT and get everything you need.
Not so fast.
I like to think I’m pretty good at prompt engineering, having built out many internal tools for me and my team. It takes a lot of trial and error to get responses the way you want it, including being factual and without AI hallucinations.
One of the things I initially did was try to use AI of course to summarize everything for me. And it did do that, but AI made many assumptions that were just plain wrong, especially when it came to variables and acronyms. I know they were wrong because they were referenced in other sections and the exact definitions or the context didn’t match what AI gave me back.
So be very careful if you’re using AI to analyze the modules and attributes, especially if you give it a simple, generic command like “summarize this documentation for me.”
That’s one of the reasons I laid out my information the way I did below. I have bullet points for the attributes and the description is what was in the documentation. Not verbatim, but cleaned up to be more easily understandable.
Any thoughts and inferences that I make are in italics in the bullet points or in a paragraph right after the bullets.
I don’t make broad conclusions without having gone through all of the documentation, although I’ll lean one way or another. So take it for what it is in that it’s a living document.
1. CompressedQualitySignals Module – Page-Level
The CompressedQualitySignals model is designed to house a set of compressed signals for documents (aka webpages), primarily used within two Google systems: Mustang and TeraGoogle.
- Mustang – This is the primary system that scores, ranks, and serves search results. The underlying ranking algorithm is AScorer
- TeraGoogle – This is an indexing system that stores data for a large number of documents on flash storage for efficient retrieval.
These signals are critical for preliminary scoring of documents and are stored in a limited memory environment to optimize for data storage and retrieval.
Key Attributes and Their Purposes
There are many attributes here that are part of the compressed quality signals for documents. Here are the more relevant attributes organized into quality signal groups:
Certainly! I’ll provide more detailed and clear descriptions for each attribute, focusing on making them accessible and easy to understand:
Quality Enhancement and Demotion Signals
- babyPandaV2Demotion: Applies penalties (demotions) based on top of the Google Panda algorithm which assesses content quality. This demotion targets pages with poor quality content to lower their search ranking.
I couldn’t see any other clear indications in the documentation on how quality is determined for Baby Panda.
- authorityPromotion: No explanation other than “converted from QualityBoost.authority.boost.”
Best guess is it enhances search visibility for pages that are recognized as having high authority, thereby boosting their ranking. Haven’t found how ‘authority is calculated, but not surprising if it’s links and topical authority through coverage.
All others with QualityBoost or other boosts are Twiddler functions. So the core ranking algorithm will give a ranking (say position 5) and these Twiddlers will re-rank it (possibly to position 2 or 10). There’s a lot more on Twiddlers and Ascorer here from Julian Redlich.
- exactMatchDomainDemotion: Reduces the search ranking of domains that precisely match search terms but may lack relevant or quality content.
This is funny because in all my tests up until May 2024, EMDs still work very well. But it could be that EMDs worked even better before this demotion was in place. It could also be that last bit “but may lack relevant or quality content.” I’ve never tried an EMD with low-quality content, like pure 100% AI. That’s just a waste of time to me because we know it’s not going to last.
I wonder how much a brand name works into this too. If there’s a brand name (eg., Yoyao) and the EMD domain (yoyao.com) isn’t showing up, my guess is it’s a manual penalty. We saw many of these after the March 2024 Core Updates. HouseFresh.com wasn’t showing up for “House Fresh” and it took the Google team intervening for their site to show back up.
- lowQuality: A low quality score that’s converted from quality_nsr.NsrData.
Unclear what NSR is exactly, but Mike King suggests “Neural Semantic Retrieval,” which makes sense. This would be a semantic score, which we know is there in a semantic search’s algorithm.
Product Review Signals
There are a lot of product review attributes here, but not much description. Makes sense there are many attributes since there was a Product Reviews Update.
- productReviewPPromotePage: No explanation
- productReviewPDemoteSite: Product review demotion/promotion confidences are multiplied to 1000 and floored, which means it’s rounded down. If it’s 20.123, it’s rounded down to 20.
Looks like specific updates like Product Reviews and Helpful Content Update (HCU) used multipliers to promote or demote sites. What’s interesting is why they need 1000x.
- productReviewPUhqPage: Identifies the possibility of pages being a high-quality product review page.
- productReviewPReviewPage: Marks a page as potentially being a product review, which can influence its ranking based on the quality of content.
- productReviewPPromoteSite: No explanation.
- productReviewPDemotePage: No explanation.
- reviously used for urgent scoring adjustments in search rankings, now deprecated.
Scam Signals
- scamness: Scam model score. Probably the likelihood of a webpage being a scam. One of the web page quality qstar signals (not clear what the qstar is).
- unauthoritativeScore: Score that’s also one of the web page quality qstar signals.
Reddit and Quora Score
- ugcDiscussionEffortScore: Quantifies the effort and quality of discussion on user-generated content pages, scaled up by 1000.
Another question of why a 1000x multiplier is needed. The understanding is there are hundreds of ranking factors, so I see the need to weigh this more, but 1000 times?
One of the most used low-competition keyword strategies a couple of years ago was looking for Reddit, Quora, and other forums ranking on page 1. That strategy went out the window when forums started ranking higher as early as the May Core Update 2022 from what I saw.
Craps Signals
It’s not clear what “Craps” are exactly, but in looking through all the different references, it is at least related to Navboost and click signals. These are all attributes in the CompressedQualitySignals module, but no clarity on the exact meanings yet:
- crapsNewUrlSignals
- crapsAbsoluteHostSignals
- crapsNewHostSignals
- crapsNewPatternSignals
Topical Relevance Data
Google is clearly using vector and embeddings for page and site score. More on this below in the QualityAuthorityTopicEmbeddingsVersionedItem Module section.
- topicEmbeddingsVersionedData: Employs versioned topic embeddings to enhance the relevance and quality assessment of documents.
Page Quality Signals
I’m guessing that “pq” stands for page quality based off context and another QualityNsrPQData module (next section) that refers to page-level quality signals.
- pqData: These are page-level signals representing page quality.
- pqDataProto: These are stripped page-level signals that aren’t in pqData.
2. QualityNsrPQData Module – Page-Level
The QualityNsrPQData module is interesting. One reason is the number of attributes that are food-related: chard, keto score, rhubarb signals, and tofu. The other reason is the possible page quality signals that are in non-food vocabulary.
- contentEffort – This says “LLM-based effort estimation for article pages.”
- deltaAutopilotScore – No description.
- urlAutopilotScore – No description.
So many possibilities to guess the meaning here. Is it referring to a check to see if the page content is possibly created from a LLM and then assigning a contentEffort score? This is the only reference that I could find in the documentation, so this is up in the air.
There’s a siteAutopilotScore (defined as “Aggregated value of url autopilot scores for this sitechunk”) in the QualityNsrNsrData module, so the Autopilot means something. Automated sites?
- deltaLinkIncoming – No description.
- deltaLinkOutgoing – No description.
- linkIncoming – No description.
- linkOutgoing – No description.
- numOffdomainAnchors – This stores the “total number of offdomain anchors seen by the NSR pipeline for this page.” Not clear if this is incoming or outgoing, but my best guess is incoming since there are a number of incoming Offdomain Anchor references with NSR.
There are a number of internal and external link attributes here that show link ratios are a possible consideration, as well as changes in the number of incoming and outgoing links. None of those attributes have clear descriptions though.
There’s a possibility that these are only referring to the internal links that are incoming and outgoing to other pages on the same domain.
With all the delta, contentEffort, AutopilotScore, and link attributes, I wouldn’t be surprised if the QualityNsrPQData module is a part of how spam, link farms, and PBNs are flagged by the content quality.
3. QualityAuthorityTopicEmbeddingsVersionedItem Module – Page and Site-Level
There’s limited information on this module, but it’s clear that page embeddings are compared to site embeddings to see how on-topic or off-topic a page is to the site.
- pageEmbedding:
- siteEmbedding: Represents compressed site/page embeddings. This likely means that the attribute stores a condensed representation of the site or page’s content characteristics, which are mapped into an embedding space for various applications such as similarity measurement or clustering.
- siteFocusScore: A numerical value indicating the degree to which a site concentrates on a single topic. A higher score could signify a greater focus, making the site potentially more authoritative on that topic.
- siteRadius: Measures how far page embeddings deviate from the site embedding. This could be used to determine the variability or diversity of content topics within a single site.
There are a number of things I take away from this:
- Use topical maps to stay on-topic and organized with content, so you continue building topical authority – or siteFocusScore.
- Pruning content on random topics would be a good idea. You know those random low-competition 10+ word keywords?
- Tie it all together. When writing content, you want to tie things back to the site and why this is content that should be written.
4. PerDocData Module – Page and Site-Level
The PerDocData is a protocol buffer used both in indexing and serving within Google systems, specifically in the search context. This documentation details various fields contained in the protocol, each used for different aspects of document processing and retrieval.
There are many attributes here, like a lot. I don’t know how much time I spent going through these and following the trail to other modules and attributes referenced. I may jump a little here with each, but it’s worth it.
I only added ones here that I thought are relevant and interesting.
Content, Content Classification and Metadata
- OriginalContentScore: The original content score is on a scale from 0 to 127 for pages with little content. It says that actual original content scores range from 0 to 512.
Be original. The information gain concept has gained traction the last couple of years along with topical authority. For good reason too. It’s how we stand out with our content.
- datesInfo: Stores information related to the dates of a document, useful for determining document freshness.
- freshboxArticleScores: Contains scores from various freshness-related classifiers, including freshbox article score, live blog score, and host-level article score. These scores assess the timeliness and relevance of content.
- freshnessEncodedSignals: (Deprecated) This attribute stores encoded data concerning the freshness and aging of content, using time-related quality metrics derived from patterns observed in URLs.
- lastSignificantUpdate: Records the timestamp of the last significant content update.
- semanticDate: Estimates the content’s relevant date using document parsing, anchors, and related documents.
- shingleInfo: There’s no description here, but a little digging shows shingle might be a content age-related system.
Content freshness is important, so regular updates are probably a good idea with that many attributes related to freshness.
- smartphoneData: Stores metadata for documents optimized for smartphone display.
This leads to SmartphonePerDocData, which contains the isSmartphoneOptimized attribute that checks to see if a page is rendering on a smartphone. It does say to also see “go/modena-ranking,” so I would say there’s a ranking factor there.
There’s also a violatesMobileInterstitialPolicy that suggests a page demotion if the page is violating the mobile interstitial policy. This one says to see “go/interstitials-ranking-dd” for more information. But Google Adsense is always suggesting interstitial ads.
One more, adsDensityInterstitialViolationStrength indicates if a page is violating mobile ads density interstitial policy and the strength of the violation.
- fringeQueryPrior: The documentation say it’s mainly for internal use by the fringe-ranking team, but they make a special note: “do NOT use directly without consulting them.”
An interesting one because of that special note. It leads to the QualityFringeFringeQueryPriorPerDocData model that contains a number of attributes referring to low-quality fringe content, including scores for potential hoaxes and poor translations.
These are other content-related attributes.
- contentAttributions: References a module that stores an attribution when one page credits another page. The external links when you credit other sites, your sources.
- eventsDate: This states that it will only capture one date (start date) for an event. Even if there are multiple dates for an event, only the start date will be taken.
- homepagePagerankNs: Contains the page-rank of the site’s homepage.
- inNewsstand: This indicates whether a document should be displayed or prioritized in news-related search queries.
- localizedCluster: This attribute manages how documents are grouped based on translated or localized pages to better serve different geographical or linguistic audiences.
- mediaOrPeopleEntities: Identifies the 5 most topical entities for media or individuals related to the content. Info is used on image search.
- MobileData: Stores metadata for lowend mobile documents in the Google index. This leads to MobilePerDocData, but nothing of note with 2 out of 3 attributes being deprecated.
- onsiteProminence: Measures the relative importance of a document within its own site, calculated by analyzing simulated traffic flows from significant site pages like the homepage and high craps click pages. Not sure why ‘simulated’ traffic. Either way, it’s clear Google scores a site’s pages.
- originalTitleHardTokenCount: This is the number of hard tokens in the title. Not 100% clear on what the ‘hard’ means, but in combination with the attribute below, it could refer to the unique words in the title.
- titleHardTokenCountWithoutStopwords: This counts the hard tokens in a title after removing common stop words. Looks to provide a clearer measure of content-specific words used in titles. Stop words are common words like ‘the, a, an’.
- PremiumData: Contains additional metadata for indexed Premium documents. This attribute refers to the PremiumPerDocData type that contains information on the price of the paywall and the publication name.
- scienceDoctype: Indicates the visibility of a science document. A value less than 0 means it is not a science document, 0 means it is a fully visible science document, and a value greater than 0 indicates a science document with limited visibility.
- topPetacatTaxId: Used to identify the top-level category of the site, which aids in query/result matching in SiteboostTwiddler.
- ymylHealthScore: Assigns scores to documents based on the quality and reliability of health information.
- ymylNewsScore: Holds scores from a YMYL news classifier for evaluation of the trustworthiness and quality of news content.
These attributes below don’t have any descriptions, but they’re rather interesting nonetheless. Like four pagerank attributes.
- bodyWordsToTokensRatioTotal: No description.
- pagerank: Deprecated and no description.
- pagerank0: No description.
- pagerank1: No description.
- pagerank2: No description.
Spam and Quality Scores
There are references to Google’s SpamBrain, their AI-based spam-prevention system that’s updated regularly.
- DocLevelSpamScore: Document spam score, also ranging from 0 to 127.
- GibberishScore: Measures the likelihood that the content of a document is nonsensical or “gibberish.” It’s scored on a scale from 0 to 127.
- KeywordStuffingScore: This score quantifies the extent of keyword stuffing within a document on a scale from 0 to 127.
- spambrainData: This attribute contains data from SpamBrain, Google’s AI-powered spam detection system. It includes host-v1 sitechunk level scores, which assess the likelihood of spam at a specific segment or chunk of a hosting domain. These scores help identify and filter out spammy content more effectively by analyzing it at the structural level of website hosting.
- spambrainTotalDocSpamScore: Represents the overall spam score for a document as determined by SpamBrain. The score ranges from 0 to 1.
- spamCookbookAction: Actions that are based on Cookbook recipes.
The spamCookbookAction doesn’t look to be about actual food recipes. It looks to be a collection of Actions taken when it matches something on the page. These ‘recipes’ are probably rules/triggers and when one is found on the page, an action is executed. The only instance of a Cookbook recipe I could find is CATfish tags that are attached to links.
- spamMuppetSignals: Contains signals related to hacked site detections. I do like the ‘Muppet’ naming convention.
- spamrank: This score assesses the likelihood that a document is associated with known spammers, using a broad scale from 0 to 65535. A higher spamrank indicates a stronger association with spam activities, which might influence the document’s visibility or handling in search results.
- spamtokensContentScore: Used in SiteBoostTwiddler to determine whether a page is considered UGC spam based on content scores.
- ScaledSpamScoreYoram: A spam score from 0 to 127. No other details.
- SpamWordScore: It’s a spamword score on a scale from 0 to 127. No other details, but it could be referring to commonly used words and phrases, and their frequency.
- trendspamScore: Represents the count of matching trendspam queries. Not clear on what “trendspam queries” are.
- uacSpamScore: Represents a spam score within a 0 to 127 range. A score of 64 or above is considered indicative of spam.
Authorship Information
- authorObfuscatedGaiaStr: List of obfuscated Google profile IDs of the authors.
Even though there’s only this one attribute for authors, there are many other other modules that cover authors. Just not here explicitly that I see. I’ve seen a number of patents that cover authorship and connecting authors to documents.
One thing I’ve been doing for some time now is to ensure that the authors and the articles tagged to them have a similar style. This is something I feel is important because of ghost writers and AI. There’s a consistency in writing style, grammar usage, spelling, etc.
Authors are entities, so getting citations and references to them are also important to them. With clients, I will usually ensure they have their social media profiles built, PR, articles, etc. spread out across the web where possible.
I’ll cover more about Authors as I come across references to them.
Site and URL Information
- blogData: There’s no description, but I followed it to a BlogPerDocData module designed to handle additional data specifically for blog posts or microblogs (like tweets). There are many attributes there, so will update in future.
- commercialScore: Scores how commercial a document is—essentially, whether its intent is to sell something. Probably a part of the Page Reviews Update to distinguish informational content from commercial content.
- hostAge: Refers to the earliest date a host or domain was first seen. It states “These data are used in twiddler to sandbox fresh spam in serving time” and “If this url’s host_age == domain_age, then omit domain_age.”
Let’s talk about the sandbox first. There’s a sandbox. Done.
The intriguing thing here for me is also the “host_age == domain_age.” If you change hosts, then the host age resets. It’s not clear whether or not you would “serve time,” even if for a short period while the systems crawl through your site to see if there are other significant changes.
And of course, it looks to be an additional data point for aged domains vs expired domains because the age resets completely if the domain expires. With an aged domain, the domain_age doesn’t change because it’s never dropped.
- fireflySiteSignal – Contains site information for Firefly ranking changes.
I followed Firefly to QualityCopiaFireflySiteSignal and noticed a number of site-level attributes there. I cover them below. There are attributes for good clicks, impressions during a Boosted Period, attributes for content velocity, latest byline dates, and more.
- nsrIsVideoFocusedSite – (Deprecated) This indicates whether a site is video-focused, but not focused on any major video hosting domains. It’s deprecated, but there is an equivalent field inside another nsr module. GSC also has features that will indicate if it determines pages are video-focused or not.
- queriesForWhichOfficial: This attribute identifies the “official page” for a search query. The example they gave is www.britneyspears.com as the official page for its set of triples (query, country, language) – “britney spears”, “us”, o (for English).
- voltData: This contains user experience (UX) signals and Core Web Vitals signals for ranking changes. So it is/was a possible ranking factor?
Other Attributes
- crawlPagerank: This transfers the pagerank given during crawls from the original canonical documents to the ones Google selects during indexing. It’s interesting they say that canonicals are ones they choose and “outside sources should not set it.” It’s still up to Google to choose the canonical and why I sometimes see the “Google chose a different canonical” message in Google Search Console.
- scaledSelectionTierRank: Encodes the selection tier rank, used to normalize scores across different serving tiers (Base, Zeppelins, Landfills) for the document. So there are different indexing tiers?
- socialgraphNodeNameFp: This stores a fingerprint identifier for a Social Graph node name. Looks like it’s used as a key to retrieve social search data before processing the document in the mustang backend.
- travelGoodSitesInfo: This stores evaluations or classifications of travel sites, identifying those considered as reputable or high-quality sources (assuming “good” means ‘good’).
The travel sites attribute is an interesting one because it singles out the travel industry. I haven’t seen other industries singled out like this yet. Not surprised if the continued decline in search visibility of travel sites is due to that. There are other modules and attributes for travel sites that I’ve seen and will look more into.
5. QualityCopiaFireflySiteSignal Module – Site-Level
The Firefly module contains many site-level ranking signals that are used in the “search stack.” Most of the attributes have no descriptions, so these are mostly derived from the attribute name.
- dailyClicks: Number of daily clicks a site receives.
- dailyGoodClicks: Number of daily clicks classified as ‘good’, though the criteria for ‘good’ are not specified.
- dataTimeSec: Timestamp data likely representing when the data was recorded.
- firstBoostedTimeSec: The first timestamp when the site was boosted.
- impressionsInBoostedPeriod: Number of impressions the site received during the period it was boosted. Not clear what ‘impressions’ are because I feel it could be search results or site views.
- latestBylineDateSec: The most recent timestamp for a byline, a content freshness indicator.
- latestFirstseenSec: Timestamp of when the site was first seen or indexed.
- numOfArticles8: Number of articles on the site, but I don’t know what the ‘8’ is for.
- numOfArticlesByPeriods: A list of article counts over 30-day periods, with the most recent period first. Only includes articles with a lattice score of 0.8 or higher. Not sure what the ‘lattice article score’ is.
- numOfGamblingPages: Number of pages related to the gambling industry. This is an interesting attribute that they’d single out gambling here. But it could also be using ‘gambling’ metaphorically and referring to high-risk, high-reward type of content. Potentially finance and investment articles.
- numOfUrls: Total number of URLs on the site.
- numOfUrlsByPeriods: Number of URLs over 30-day periods, with the most recent period first.
- totalImpressions: Total number of times the site was viewed or shown in search results.
There are a few attributes that aren’t completely clear here. Maybe they’re explained more elsewhere, but I haven’t found them yet. From what I can infer from this, the importance is on:
- Content freshness and velocity
- Clicks matter
- There may be a site-level content quality metric using the ratio of ‘good’ articles vs ‘not so good’ articles. That includes a consistent production in considering of its regular publishing velocity.
- There might be something there with the ratio of URLs and Articles and the type of content that’s regularly published to the site. While all Articles have a URL, not all URLs are Articles.
There are attributes for good clicks, impressions during a Boosted Period, attributes for content velocity, latest byline dates, and more
If you want to do your own exploration of the documents, head over to hexdocs.pm.