Recognising ‘bad actors’ in data leaks with AI

Journalists are often required to find persons of interest in large sets of documents, which can be like searching for a needle in a haystack. With the help of AI we try to link relevant names in these documents to specialised databases in order to find out who deserves our attention.

A JournalismAI Project

Team

Luis Flores, Data Scientist, The Guardian
Michel Schammel, Senior Data Scientist, The Guardian
Dimitri Tokmetzis, Senior Investigative Journalist & Data Team Lead, Follow the Money
Heleen Emanuel, Data Journalist & Creative Developer, Follow the Money
Alet Law, Audience Development Manager, Daily Maverick
Tinashe Munyuki, Retention Manager, Daily Maverick

Github Repository

A copy of the code used for this project can be found in the JAI Bad Will Hunting Github repository.

🔍 Our motivation for the Fellowship

✂️ Focusing our approach

🛠️ Constructing an entity linking pipeline

🗂️ The datasets

🧑‍🏫 Training the model

✅ Evaluating the model

💡 Challenges and lessons learnt

↪️ Exploring alternative approaches

❓What next?

🔍 Our motivation for the Fellowship

For 2022’s JournalismAI fellowship, Daily Maverick, Follow The Money and The Guardian teamed up with a shared aim to uncover “bad actors” hidden away within extensive digital corpora, of which whistleblower leaks have become the most emblematic example.

In the age of the internet, the landscape of journalism and the notion of journalism’s role as society’s watchdog have not remained untouched by technological advancements. The ever-growing digitisation of information, while providing valuable access to previously unobtainable sources of information, has left journalists inundated with more data than they could ever process manually. In already resource-strapped newsrooms, the time and tools needed to sift through large datasets for potential stories are costly.

With all three publishers working in investigative journalism involving large data leaks, our vision at the start of the Fellowship was to build an AI pipeline to automatically surface and organise “people of interest” in large volumes of data. This would make uncovering “bad actors” quicker and easier and lead to exclusive stories.

✂️ Focusing our approach

Data journalism invariably requires several steps including handling, organising, sifting through, and visualising large quantities of data. This entails obtaining the data, using available tools to search for entities (e.g. people or organisations) in order to find relevant connections between them. One of the biggest challenges newsrooms face in this process, one we certainly share across our three organisations, is matching the thousands of entities extracted from large datasets using Natural Language Processing (NLP) to real world people, places and things. Doing so typically demands a centralised point of reference, often in the form of specialised databases providing real-world context about entities across categories. These databases are formally known as Knowledge Bases (KBs). Wikidata is perhaps the most widely used KB, due to the richness of its content as one of Wikipedia’s sister websites, but many others exist.

While human agents are naturally good at the task of matching entities in a text to a point of reference in a KB by inferring from multiple textual hints, an AI algorithm needs to be shown a large set of examples before it can start mimicking this ability. On the other hand, even an imperfect model can automate and therefore expedite some of the groundwork taking a journalist several days, potentially bringing it down to just a few minutes.

Yet, unambiguously mapping the entities extracted from text documents (mentions), to the correct entity in KBs is a highly complex task, still under active academic research. To illustrate this point, consider the task of linking multiple individuals with political or business interests sharing the same surname to the right person within a KB. These individuals might share family ties, live in the same region, or have any number of other common traits which would be obvious to a human, but intractable to an algorithm.

The issue is compounded by highly common given or surnames, as there will likely be a very large number of candidates within KBs sharing that same name. A good example would be “Adam Smith”, which is associated with no fewer than 20 person entities on Wikipedia, chief among them the famed philosopher and economist, as well as three additional geographical and organisation entities. We refer to different entities with the same name in the KB as sharing an “alias”.

Solving this task is therefore not as simple as writing an algorithm to match text mentions to KB entities with an identical alias. An appropriate solution should include the ability to infer what are the correct KB entities from the context surrounding the mention. This is the problem our team decided to focus on during the duration of the Fellowship.

🛠️ Constructing an entity linking pipeline

The aim we set ourselves with this project was thus to contribute to tackling the ongoing difficulty in assigning real-world meaning to textual mentions, often called “Entity Linking” or “Disambiguation”. A widely studied AI-based answer to this problem is to train a specific kind of model, aptly called an Entity Linker (EL). Figure 1 shows a flow diagram of the main pieces involved in the EL approach. This type of model learns how to map the text mentions (inside a Document in Figure 1) to unique identifiers within KBs, thereby grounding the mentions in the “real world” by generalising from thousands of annotated examples. With this approach, we envisioned our proof-of-concept pipeline could be adaptable to particular datasets or leaks, by providing specific examples to learn from on a case-by-case basis.

Screenshot 2022-11-29 at 13.45.54.png

Figure 1: End-to-end Entity Linking pipeline

Our Entity Linking pipeline starts with an element capable of extracting all the mentions in any given text. This element is called a Named Entity Recognition model (NER in Figure 1). The NER element automatically scours through the text, locating and classifying entities into predefined categories such as person, organisation, location, etc (Figure 2). Due to the time constraints of the festival we decided to focus the development of our prototype exclusively on person mentions.

Screenshot 2022-11-29 at 14.16.12.png

Figure 2: An illustration of how a Named Entity Recognition model identifies entities in a text.

Once we locate relevant mentions of person names in a text, the EL module (Candidate Generation, Scoring, Threshold in Figure 1) will act as an intermediary between each mention and a KB match. When linking occurs, the mention will thus be enriched with an identity supplemented by “real-world” information and is therefore disambiguated. When we extract the alias “Adam Smith” from a given paragraph, we now know if that mention referred to a philosopher, a politician or a fictional character without having to read the document and manually registering the information. If there happens to be a different Adam Smith somewhere in the text we will be able to disambiguate the two automatically.

Besides the KB, our EL module requires a function to generate plausible candidates. In order to optimise the performance of the pipeline, the candidate generation needed to be able to filter through hundreds of thousands of KB entities to surface only a handful of viable candidates for each mention. The selected candidates are then presented to the EL element, a specialised machine learning model capable of picking the correct KB entity from the candidate list given the local semantic context of the mention.

🗂️ The datasets

Having designed a proof-of-concept EL pipeline, we needed to consider which datasets would form the basis of our project and could be used to train a prototype on. Partly due to time constraints, we decided to use a set of articles from the Guardian newsroom as a stand-in for a large pool of digital documents. Additional advantages of the dataset were that it was readily accessible via the Guardian’s content API, required minimal pre-processing, and contained a wide range of well-known entities to develop and validate our approach.

At the same time we wanted the project to be innovative by selecting KBs which, to the best of our knowledge, had not been used for EL before. In fact, most EL models currently available have been trained on datasets constructed from massive online KBs such as DBPedia or Wikipedia/Wikidata, but our intention was to work with less common resources linked to subjects such as politics, finance, crime and corruption. We also reasoned these specialised KBs might contain domain-specific entities not found in the more generic KBs. We thus decided to combine two open access KBs of relevance in investigative work — LittleSis and Open Sanctions. LittleSis focuses on social networks of politicians, business leaders, lobbyists, financiers, and affiliated institutions, while Open Sanctions is a directory of people and organisations associated with criminal exposure and activity. Both of these contained information on powerful or politically exposed people and organisations, as well as little-known associates.

Screenshot 2022-11-29 at 14.14.57.png

With all the required elements in place, we proceeded with training our model.

🧑‍🏫 Training the model

Machine learning models learn to make predictions by generalising from thousands of examples. To train the EL model we first needed to generate a training dataset by hand. EL training datasets consist of a large number of texts containing mentions (identified through the NER module) and the link between each mention and the correct identifier in the KB.

In this project we used two tools created by Explosion, a software company specialising in NLP tools. SpaCy is one of the leading open-source libraries for advanced NLP using deep neural networks. Prodigy is an annotation tool that provides an easy-to-use web interface for quick and efficient labelling of training data.

We used these tools to put together a semi-automatic step of training data assembly using our team members as human-in-the-loop agents for case-by-case curation and quality assessment - a process known as annotation.

Screenshot 2022-11-29 at 14.13.30.png

Figure 3: An example of how Prodigy is used to annotate training data.

As with all NLP projects involving hand annotations, we started by writing a clear and concise set of guidelines. By following the rules from the annotation guide we minimise undesired ambiguity in the training dataset. We also created a decision tree chart to convey the annotation instructions simply and visually, so multiple annotators could understand the task in the same way (Figure 4).

Screenshot 2022-11-29 at 13.47.50.png

Figure 4: Annotation decision tree

The guide also included the edge cases we encountered during our group annotation sessions. These discussions shaped the guide nicely and added a clear structure to our future work.

Altogether we counted on a team of nine annotators working for a total of about 40 hours, at a rate of 100 annotations per hour, to produce more than 4,000 examples linking mentions in text to entries in the KB. We divided these into a training dataset, which we used to train our model, but reserved 30% to evaluate performance independently.

After several months of work, with all pieces of the pipeline in place and the training data we generated, we were able to successfully train an EL model prototype using Spacy’s dedicated package for the task.

✅ Evaluating the model

To evaluate our prototype model we reserved around 1000 of our annotations as a separate test dataset, preventing information leakage between the training and test dataset. Based on a preliminary analysis, our bespoke rules-based candidate generation algorithm was able to rank the correct candidate within the top 10 positions with a high degree of success. Our early testing also suggested that the subsequent machine learning model’s accuracy was of around 70% in selecting the right candidate out of these options. Nonetheless, upon closer inspection of the wrong predictions, it was clear that the model still struggled with selecting the appropriate candidate in too many cases, and especially so in single-name mentions. Performance therefore needs further improvement before we can deploy the EL pipeline in our newsrooms.

By carefully inspecting the model’s predictions, however, we were able to extract many valuable learnings which will help us improve on future iterations, some of which we share below.

💡 Challenges and lessons learnt

Creating datasets for an EL task proved extremely challenging. When we decided to merge two databases we did not anticipate the amount of effort needed to deal with challenges such as duplicate entities, clashing/outdated information and inconsistent formatting. Furthermore, some of the descriptions for lesser-known people were too short or vague to be of use for the model.

The choice of KB added further complexities in the form of unexpected issues around surfacing the right candidates for annotation. To limit the amount of time required per annotation we decided to display only the top 10 best matches between KB alias and the mention, down from a pool which could range to several thousands for particularly common names. The annotators could then select the correct cases from within the list of 10 candidates which considerably sped-up the process. On the other hand, this constraint also required us to take some time devising an algorithm to filter and refine the list of candidates surfaced for each mention in the text. In the end, we used a combined metric of fuzzy string matching and shared vocabulary between the mentions and the KB descriptions.

All these issues would have been less prevalent in more widely used KBs, and really illustrated that KB quality is a major determinant of success in Entity Linking. Therefore the limitations of each particular KB should ideally be well understood and factored in as early on in the development process as possible.

Another issue we faced related to the sampling of mentions from our Guardian articles. Despite our best efforts, it became clear that our datasets did not contain enough examples of mentions with multiple candidates sharing the same alias (eg. “Adam Smith”) but these were crucial examples to allow the model to learn as intended. There were also not enough examples of mentions consisting of a single name which are particularly hard to disambiguate and our prototype particularly struggled with these examples in the test dataset.

Moreover, although the team was able to annotate a fairly large number of examples, not all annotations could be used for reasons mainly including missing KB matches or insufficient context.

Finally, more variability is needed in the paragraphs we sampled as more pertained to politics relative to topics such as business, finance and crime.

Due to these factors, more training data is needed to show the second iteration of the model enough examples to learn how to solve particularly hard links. With the current sampling approach we would need one annotator to work non-stop for nearly 6 weeks to collect 10,000 useful examples. Fortunately, we can apply the lessons we’ve learnt during the Fellowship to improve this process and considerably speed up the data collection process.

Despite the challenges we faced, some of which not yet fully resolved, learnings from this project have been invaluable. Having partaken in this journey we now have a much better understanding of the advantages and limitations of entity linking, some of which are transferable across the suite of tools and techniques our teams typically rely on.

Another gain from this work is that it created new avenues which we started exploring and will continue working on beyond the Fellowship, which we introduce in the following section.

↪️ Exploring alternative approaches

A separate workflow arising from the many discussions generated by this project is motivated by finding connections within texts by looking at the relationships of an entity with preceding and subsequent entities. This concept would bypass the need to look for context in external KBs, instead relying on extracting context from the documents themselves. For instance, in a paragraph in which a famous politician is named, what entities identify them as a “person of interest”? Could mentions like political parties or political advisers provide enough context?

While some of these examples are obvious to a human agent, there are many people of interest that are not well known, but for whom we believe it is possible to find enough context and relationships within a piece of text to identify them as “key players”. The question then becomes: “Can we train a model exclusively using context in text to identify these relationships?”

Yet another approach would be to use the syntactic relationships in a piece of text to find the relationships between entities. The spaCy library, for instance, offers a powerful parsing library which identifies syntactic relationships between all words in a sentence (Figure 5). Can we use these relationships to link different entities?

Screenshot 2022-11-29 at 13.49.38.png

Figure 5: SpaCy’s Parser

Team

Github Repository

Table of Contents

🔍 Our motivation for the Fellowship

✂️ Focusing our approach

🛠️ Constructing an entity linking pipeline

🗂️ The datasets

🧑‍🏫 Training the model

✅ Evaluating the model

💡 Challenges and lessons learnt

↪️ Exploring alternative approaches

❓What next?