In Determining Truthfulness, a Google Team Would Like to Do Your Thinking for You

New Scientist reports on a paper by a Google team, "Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources." Coming from a company with a sizable and voluble contingent of progressives in its work force, the proposal sounds like a nightmare for politically incorrect Web sources:

The Internet is stuffed with garbage. Anti-vaccination websites make the front page of Google, and fact-free "news" stories spread like wildfire. Google has devised a fix — rank websites according to their truthfulness.

Google’s search engine currently uses the number of incoming links to a web page as a proxy for quality, determining where it appears in search results. So pages that many other sites link to are ranked higher. This system has brought us the search engine as we know it today, but the downside is that websites full of misinformation can rise up the rankings, if enough people link to them.

A Google research team is adapting that model to measure the trustworthiness of a page, rather than its reputation across the web. Instead of counting incoming links, the system — which is not yet live — counts the number of incorrect facts within a page.

Ominous? Not really. You have to understand, the situation is complex. Google is a business and a very successful one, where the people actually in charge want to make money. Rolling out an innovation that would turn off many users is not something that, at first glance, they seem likely to do. If Google were getting ready to make such a change, is this the way they’d go about it, heralded by a research paper published online? Hardly. Those in the know about tech would laugh. It’s just research, at least for now. Major companies fund R&D to the tune of billions. It doesn’t mean anything until the marketing department gets involved.

The New Scientist article is certainly overselling the significance of the paper. But what if this "fix" really were to be introduced?

Determining "facts" inherently introduces bias, even if unintended. For Google’s part, of course, the company promotes its algorithms as unbiased — that is their business model. But if implemented the approach would not magically compute a measure of trustworthiness for content on the Web. The cited paper uses a model based on information extraction to find simple statements (such as: "Larry Page is CEO of Google") that then will get tagged and compared against a database.

Such a method would guarantee numerous errors, though, for one because "facts" are constantly changing. Is "Eric Schmidt is CEO of Google" a fact? Even worse, "facts" are embedded in a context. A site that praises ISIS may contain factual information about the number of journalists the group has beheaded to date. Is the site then more "trustworthy"? So as I said, the issues here are complex.

My guess is that Google would never use this approach as a stand-alone ranking system, because it wouldn’t work well enough. Instead it would be incorporated as a signal in the existing search (which already contains hundreds of signals beyond link analysis). If it improved the perception of search results, and complaints about bias were confined to a small subset of its users, all would be well — unless you happened to be the minority that witnessed your website suddenly disappear from the top rankings, or even worse, get a low trustworthiness score for bogus or contestable reasons.

Google for its part would do as it always has done, and take refuge in the algorithm that creates the greatest perception of neutrality for the greatest number of its users. That’s how they do business.

By the way, the question of bias is not confined to this research project at Google. It’s pretty much Web-wide. Twitter, for instance, didn’t "trend" certain political events considered quite major by the broader pubic. I believe Occupy Wall Street was one of them, ironically. The company simply claimed an event did not in fact trend, according to their algorithms. Problem solved? No, people were outraged, thinking the issue was a hidden political bias, but the reality is that the bias was no one’s "fault." It was inherent in Twitter’s computation of trends. NPR put it this way:

Sometimes a topic that seems hot, like Occupy Wall Street, doesn’t trend, leading some activists to charge Twitter with censorship. But the complex algorithms that determine trending topics are intended to find what’s trending in the moment, and not what’s been around for a long time.

Returning to Google, the same issues would surface. Google could claim neutrality given its algorithm, and disgruntled users would yell into a vacuum. The issue would be particularly tendentious because "facts" and "trustworthiness" are at stake, instead of popularity on a network.

So, the "just the facts" approach is, I think, a tenuous strategy for improving popularity-based ranking. It’s tempting to explore it, I’m sure, because it fits the narrative advanced by major companies like Google or Twitter that their algorithms are neutral and objective. Which is simply not true. With this specific proviso, seeking other signals for search other than popularity-based ones is necessary and laudable.

When Page and Brin developed PageRank in the late 1990s, the earlier Vector Space Model approach was delivering loads of irrelevant hits to Web search users. The results from sites like Alta Vista were really bad. PageRank ushered in search that worked, in the sense of determining the basic relevance of results given a query. We loved it.

But today we can see the limitations. Popularity ranking squeezes out quality results whenever other content is more popular, even if for the wrong reasons (say, because it’s inflammatory, or short and mindless). Quality-based assessment of Web pages will be necessary for improvements today. To some extent Google does this currently, using simpler methods like URL-analysis, spell-checking, and so on.

So research aimed at improving existing search beyond a "mob rules" approach and toward a measure of internal quality is the right path forward. But "just the facts" is a questionable strategy. It’s fraught with technical and contextual challenges, and it further promotes the myth of objectivity on the Web.

Image: Googleplex between buildings 40 and 43, by Jijithecat (Own work) [CC BY-SA 4.0], via Wikimedia Commons.