Methods and Tools
Open Source Tools and Methods for Open Source Investigations
Access to Yemen in order to investigate and report on human rights violations is very restricted and dangerous for independent journalists, international news agencies, UN investigation bodies, and international human rights organisations. This is the principal reason the Yemeni Archive and other documentation groups depend on verified user-generated content to assist in criminal case-building as well as human rights research.
The Yemeni Archive strives for transparency in its tools, findings, and methodologies; as well as making sure that verified content is publicly available and accessible for journalists, human rights defenders, and lawyers working for the purpose of reporting, advocacy, and accountability purposes.
To achieve transparency, software developed by the Yemeni Archive is released in free and open-source formats. This is done to ensure trust is built and maintained with our partners and collaborators, as well as allowing software to be reused and customised by other groups outside of the Yemeni Archive. Technical integration with existing open-source investigative tools ensures that work is not duplicated. The Yemeni Archive works alongside technologists to develop our open-source tools. Our methodology is developed in collaboration with other archival groups, as well as lawyers and journalists.
The Yemeni Archive’s Data and Operational Model is based on the Electronic Discovery Reference Model developed by Duke University School of Law. Below is a detailed description of every step of this model.
Our collection process involved establishing a standardised metadata schema alongside a database of credible sources for digital content. Sources can be direct submissions from individuals and organisations, publicly available social media accounts and channels, as well as other publicly available information.
1) Establish database of credible sources for content
Before any collection, archival, or verification of digital materials was possible, we had to establish a database of credible sources for visual content. We have identified over 1.167 credible sources, including individual journalists and field reporters, larger media houses (e.g. local and international news agencies), human rights organisations, local field clinics and hospitals. Many of these sources began publishing or providing visual content in 2015 and also publish work in other credible media outlets.
Visual content is primarily accessed through social media channels (Twitter, Facebook, YouTube, Websites, Telegram), submitted files (videos, photos, pdf), and external and collaborators’ data sets. Changes in these data sets are tracked, meaning that all versions are saved.
Credibility is determined by analysing whether the source is familiar to us, or our existing network, as well as checking that the source’s content and reportage has been reliable in the past. This might include evaluating how long the source has been reporting and how active they are.
To identify where the source is based, social media channels might be evaluated to determine if videos uploaded are consistently from a specific location, or whether locations differ significantly. Channels and accounts might be analysed to determine whether they use a logo and whether this logo is consistently used across videos. Channels and accounts might be additionally analysed for original content to determine whether the uploader aggregates videos from other news organisations or accounts, or whether the source appears to be the primary uploader.
The following map shows our sources distribution:
2) Establish database of credible sources for verification
Yemeni Archive established a database of credible sources for verification. These sources provide additional information used to verify content on social media platforms or received from sources directly. Content verifiers include citizen journalists, human rights defenders and humanitarian workers based in Yemen and abroad. To preserve data integrity, sources used for content acquisition do not comprise part of the database for verification.
3) Establish standardised metadata scheme
Before we can preserve or verify any content we must define a system through which content can be managed and organised, this is done through metadata. Establishing a data ontology or metadata scheme is necessary to assist us in organising and managing content as well as helping users in identifying and understanding what happened, and when and where.
Whilst recognising the need for a data ontology, or standardised metadata scheme. We also recognise that the implementation of any metadata scheme is a highly political choice. Given that there are no universally accepted, legally admissible metadata standards, efforts were made to develop a framework in consultation with a variety of international investigative bodies. These include consultations with members of the United Nations Office for High Commissioner of Human Rights, and with other archival institutes, and human rights and research organisations.
Adding metadata happens after content is preserved but it is crucial to define a metadata scheme before collecting and processing content.
Yemeni Archive’s collection and secure preservation workflow ensures that original content is not lost due to its removal from corporate platforms. This is achieved by collecting and securely storing digital content on external backend servers before it is taken offline and prior to basic verification procedures. Content is then backed up securely on servers throughout the world. We use Sugarcube for this process, a free and open-source software developed for human rights investigations using online user-generated content.
Sugarcube is a tool designed to support journalists, non-profits, academic researchers, human rights organisations and others with investigations using online, publicly-available sources (e.g.tweets, videos, public databases, websites, online databases).
In this preservation pipeline we detect the spoken language, and standardise the data format (whilst preserving the old format). We screenshot and download the web page hosting the content. Files that are in our database get both their
sha256 hash and are time-stamped with Enigio Time - a third party collaborator. We hash and timestamp in order to ensure and prove data integrity which means that data has not been changed or manipulated since it has been archived.
Once content has been safely preserved metadata is extracted from visual content, its parsed and aggregated automatically using our predefined and standardised metadata schema. Location and source details might be included in the parsed metadata which can be useful to geolocate where content originates.
Metadata is added both automatically and manually, depending on how it was collected, e.g open source or closed source. A detailed description and full list of metadata field types are provided on our website. Metadata we collect includes a description of the visual object as given (e.g. YouTube title); the source of the visual content; the original link where the footage was first published; identifiable landmarks; weather (which may be useful for geolocation or time identification); specific languages or regional dialects spoken; identifiable clothes or uniforms; weapons or munitions; device used to record the footage; and media content type.
The processing pipeline also passes video files into keyframes, as well as using the machine learning software, V-FRAME.
VFRAME is a collection of open-source computer vision software tools designed for human rights investigations relying on large datasets of visual media. It utilises object detection algorithms that can automatically flag video content depicting predefined objects, such as cluster munitions.
Our data pipeline prepares visual content for initial verification. All possible additional tags and chain of custody information is recorded. This is done to assist users in identifying and understanding what happened in a specific incident, and when and where.
Verification consists of three steps: 1) Verify the source of the video uploader or publisher; 2) Verify the location where the video was filmed; 3) Verify the dates and times on which the video was filmed and uploaded.
Verify the source of the video publisher
Firstly we establish whether the source of the video is on our list of credible sources. If not, we determine the new source’s credibility by going through the above procedure.
In some cases, near-duplicate content may be published. For example, if a 10-minute video includes all of a second 30-second video – both videos would be preserved as long as they can be verified. Similarly, videos from news organisations or media houses featuring parts of other videos are also preserved– as long as verification is possible. We also preserve duplications if they are from different sources and the original uploader is unidentifiable.
The video-upload source may differ from the camera operator. In most of the video footage which we verify, only the video uploader and not the camera operator can be identified. In advanced verification of priority cases, the analysis phase includes identifying the camera operator.
Verify the location where the video was filmed
Each video goes through basic geolocation to verify that it has been captured in Yemen. A more accurate geolocation process is implemented for priority content in order to pinpoint its origin to a more accurate location. This is done by comparing visual references (e.g. buildings, mountain ranges, trees, minarets) with satellite imagery from Google Earth, and Maxar as well as geolocated photographs from Google Maps. Satellite imagery is also used to assess damage and destruction whilst investigating attacks targeting civilians and civilian infrastructure.
In addition to this, the Yemeni Archive compares the Arabic spoken in videos against known regional accents and dialects within Yemen to further verify the location of videos. When possible, we contact sources directly, and consult our existing network of journalists operating inside and outside Yemen to confirm the locations of specific incidents.
Verify the dates and times in which the video was filmed and uploaded
We use time and date metadata embedded in videos we directly receive in order to corroborate the date and time of a specific incident. Date and time are extracted using the ExifTool[https://exiftool.org/].
We verify the capture date of videos by cross-referencing their publishing date on social media platforms (e.g YouTube, Twitter, Facebook, and Telegram) with dates from reports concerning the same incident. Visual content collected directly from sources is also cross-referenced with reports concerning the incident featured in the video.
Those reports include:
- Reports from international and local media outlets;
- Human rights reports published by international and local organisations, including Human Rights Watch, Amnesty International, Yemeni Data Project, and Physicians for Human Rights;
- Incident reports shared by the Yemeni Archive’s network of citizen reporters on Twitter, Facebook, and Telegram.
Additional techniques such as reverse image search and chronolocation can also be used to confirm the capture time and date of the visual content.
Investigation and further analysis
In some cases, we conduct in-depth open-source investigations. Time and capacity limitations mean not all incidents can be analysed in-depth. However, by developing a replicable workflow it is hoped that others assist in these efforts, and investigate other incidents using similar methods. A detailed overview of our in-depth incident analysis is provided in the investigations page of our website.
For some incidents, our team of researchers collect witness statements or partner with organisations that do. In the past, this has included working with Justice for Life, and other organisations whose role is collecting accounts of survivors, the injured, family members, or eyewitnesses (e.g. medical staff, managers of hospitals).
Once content has been processed, verified, and analysed, it is reviewed for accuracy. In the event of a discrepancy, content is fed back into the digital evidence workflow for further verification. If content is deemed accurate it moves to the publishing stage of the digital evidence workflow.
For more information the tools or methods we are using, please reach out to info [at] yemeniarchive [dot] org