This is the first part of a series to review Bias in Knowledge Graphs (KG). We aim to describe methods of identifying bias, measuring its impact, and mitigating that impact. For this part, we’ll give a broad overview of this topic.
image credit: Mediamodifier from Pixabay
Knowledge graphs, graphs with built-in ontologies, create unique opportunities for data analytics, machine learning, and data mining. They do this by enhancing data with the power of connections and human knowledge. Microsoft, Google, and Facebook actively use knowledge graphs in their products, and the interest from large and medium enterprises is accelerating. Andrew Reed gives a great overview of knowledge graphs in a previous article.
How are knowledge graphs used? Often they are deployed in the backend of an application, for example, supporting search results or responses from conversational AI. In other cases, knowledge graphs are used more directly to grow a knowledge base by finding or validating new information.
As the usage of this technology ramps up, bias in these systems becomes a problem that can contaminate results, degrading the user experience or driving bad decisions. In the last 1-2 years, interest has grown in identifying and removing bias.
Here are some hypothetical cases where bias in knowledge graphs could raise issues:
Conversational AI: Catherine, a college junior, interacts with a ‘career bot’, a conversational AI agent that offers job advice to graduating students. A knowledge graph based on the university’s record of successful alumni underpins the AI agent. Catherine is a pre-med major with aspirations to become a surgeon. In the school’s records, most successful surgeons are male. The conversational AI steers Catherine towards medical fields where there are historically more women.
image credit: bongkarn thanyakij from Pexels
Search: John is using a search engine to research vaccines. He is a layman with no deep knowledge of this area. The search results include hyperlinks and a sidebar of information and links generated from a large structured data source (based on “Wiki-Encyclopedia”). Wiki-Encyclopedia’s article has been curated and updated by many people who have strong – but false – notions about the side-effects and efficacy of vaccines. As a result, when John reviews the search results and sidebar, he comes away with flawed – not well informed – notions about vaccines.
Knowledge Base Building: A hospital is building and expanding a knowledge graph. Part of this process involves algorithmically accepting or rejecting new ‘facts’ to add to the knowledge graph. If the foundational data is itself biased, it could lead to the machine rejecting legitimate facts that go against the bias of the foundational data.
Types of Bias
In general, our work is focused on bias that results in “systematic errors of judgment and decision making” by the consumers of KG & ML applications*.
Bias is a broad topic, which has many context-dependant definitions. Data scientists and statisticians are concerned with bias that is more technical and measurable, while less technical stakeholders may have their own definitions and standards for identifying when bias occurs.
Within the machine learning community, several types of bias have been identified and studied (Mehrabi, et. al. define 23 types of bias relevant to machine learning in a recent paper.)
Bias Along the ML/Analytical Pipeline
Aside from the types of bias, there are also places in the stages of an analytical or machine learning pipeline where bias can be identified.
Data. Structured and unstructured data form the raw materials for building knowledge graphs. This data can be crowd-sourced, as with Wikipedia and Amazon’s Mechanical Turk, or it can be gathered and curated privately, as with a private corporation’s records and transactions.
If data was generated by people with a prevalent opinion (self-selection bias) or from a majority of people of a certain cultural perspective (sometimes called representational or population bias), this can impact the downstream results. An example of self-selection bias is when customers who have strong motivations write service reviews. These may not reflect that majority of customers, but if a knowledge graph is built on top of such data, it may learn a distorted view of customer sentiment.
Semantic/Ontology. Ontologies are a framework of meaning which supports the input data and their relationships. Such frameworks are constructed top-down or bottoms-up, and can be manually designed or formed algorithmically. If built by a team of experts, conscious and representational bias can impact the structure of the ontology. If built by machine, bias in the underlying data can bleed into the ontology.
An example can be found in geographical ontologies. Anthropocentric biases lead designers to over emphasize human-centric locations versus natural ones. The Place branch of the DBpedia ontology (as of 2015), contained “dozens or even hundreds of classes for various sub-classes of restaurants, bars, and music venues, but only a handful of classes for natural features such as rivers” [Jancowicz].
Knowledge Graph Embeddings. Embeddings are lower-dimensional representations that enable more efficient processing of knowledge graph data, which is normally in a high-dimensional, and hard-to-wrangle form. It has recently been shown that social biases in knowledge graphs can get passed on to their respective embeddings [Fisher].
Inferential. Inference refers to when a query, machine learning algorithm, or fact-learning algorithm learns from a knowledge graph, or its embeddings. An oft-mentioned example is that of an inferential algorithm learning that only men can be the US President, because historically that has been the only case.
In the next part of this series, we’ll examine in more detail concrete examples of the data and ontology bias, and examine known methods to detect and measure such bias.
J. Fisher, Measuring Social Bias in Knowledge Graph Embeddings, Dec 2019.
K. Janowicz, et. al, Debiasing Knowledge Graphs: Why Female Presidents are not like Female Popes, Oct, 2018.
N. Mehrabi, et. al, A Survey of Bias and Fairness in Machine Learning, Sept 2019.
*Drawing from the definition in the K. Janowicz reference.
More from the Blog
Feb 27 2020
by — Why do privacy and governance matter? Data privacy has been a common conversation topic among the general public since the Cambridge Analytica scandal in 2018. The data “breach,” in which user information was hoovered up through a Facebook quiz and subsequently misrepresented as being used for academic purposes, resulted in over $5 billion in fines for Facebook. However, Facebook’s infringemen...
Apr 1 2020
by — At Cloudera Fast Forward, one of the mechanisms we use to tightly couple machine learning research with application is through application development projects for both internal and external clients. The problems we tackle in these projects are wide ranging and cut across various industries; the end goal is a production system that translates data into business impact. What is Enterprise Grade...
Dec 19 2014
We’re very pleased to announce our second research report topic will be realtime stream analysis, with a focus on probabilistic data structures. Using these techniques, we’re able to build systems that enable extremely fast and memory efficient computation over very large data sets. For example, imagine being able to do comparisons between two sets of billions of items in milliseco...