The Yemeni Archive Identify, collect, and preserve digital content in a variety of formats from a variety of sources, and platforms. Content is stored in a centralised database and then accessed by our open-source tools to search, investigate, verify, categorise, and expand. Rather applying out-of-the-box technology solutions to the content challenges faced by the Yemeni Archive team, we take a user-centered approach to developing or contributing to custom open-source modular solutions (e.g. Sugarcube), allowing our teams of investigators to collaboratively work on our data pipeline.
Sources and Collection
Our database collects data from a list of sources in a variety of different source types. We acquire data and posts daily from these sources. Types of sources include social media channels (Twitter, Facebook, YouTube), submitted files (videos, pdf), and external and collaborator's data sets. Changes in these sources are tracked, meaning that all versions are saved.
Processing / Data Pipeline
Each unit of data in our database goes through our data pipeline. In this pipeline we detect the language, standardise the data format (but keep the old format as well), as well as perform other transformations. We screenshot and download the web page we received the information from.
Files that are in our database get both their
sha256 hash. These get timestamped with Enigio Time - a third party collaborator.
We use a mix of tools to verify our units. As data is never lost, we can use a variety of tools to verify or edit new fields in the data, and then merge them back together in our centralised database.
debian / nodejs / sugarcube / python / mongodb / nginx / react
This website is flat html files. The database is react, calling to an api.
Contact Us to talk about how we could help you with your project.