The weaponization of web archives: Data craft and COVID -19 publics

An unprecedented volume of harmful health misinformation linked to the coronavirus pandemic has led to the appearance of misinformation tactics that leverage web archives in order to evade content moderation on social media platforms. Here we present newly identified manipulation techniques designed to maximize the value, longevity, and spread of harmful and non-factual content across social media using provenance information from web archives and social media analytics. After identifying conspiracy content that has been archived by human actors with the Wayback Machine, we report on user patterns of “screensampling,” where images of archived misinformation are spread via social platforms. We argue that archived web resources from the Internet Archive’s Wayback Machine and subsequent screenshots contribute to the COVID-19 “misinfodemic” in platforms. Understanding these manipulation tactics that use sources from web archives reveals something vexing about information practices during pandemics—the desire to access reliable information even after it has been moderated and fact-checked, for some individuals, will give health misinformation and conspiracy theories more traction because it has been labeled as specious content by platforms. as and unique characteristics of from from the we that web like the are weaponized to health misinformation on platforms like Facebook and Twitter. Here we present two interconnected studies of data craft that leverage archived web resources to intentionally evade automated content moderation eorts and further propagate health misinformation from platforms attempting to combat. This research shows how archived URLs of web archived sources of health misinformation found in the Internet Archive’s Wayback Machine and screensampling practices of archived content appear to be dicult for automated content moderation systems to identify and ban, and as a result circulate longer and spread on platforms. In order to understand these information practices that shape emerging COVID-19 publics convened by private platforms in the aftermath of pandemic, misinformation researchers and platform operators need to carefully consider how the circulation of archival content from the web now appears in platforms and its status in the wake of the coronavirus. disinformation moderation used inauthentic One well-documented data craft technique mimicking legitimacy by publishing fake that appears to be credible information (Acker & Donovan, 2019). In their study investigating the misuses of web archives on social media, Zannettou et al. (2018) found that news articles and social media posts were the most common web resources to be saved in Archive.is and the Wayback Machine. They found that these kinds of URLs circulated among Reddit forums when the original web content was assumed to be controversial or ephemeral. The data craft reported here is designed to leverage the legitimacy of the Wayback Machine’s archival infrastructure in order to deploy health misinformation into platforms by circumventing moderation eorts. This research note explores the extent to which misinformation and other types of “junk” content are spread on political boards and forums on 4chan and Reddit. Our ndings suggest that these userbases are impervious to the appeal of low-quality “pink slime” news sites with algorithmically generated conservative talking points masquerading as journalism.

Using provenance information such as original context, technical speci cities, and unique characteristics of online resources from web crawls, and social analytics data from the Crowdtangle API we nd that web archives like the Internet Archive's Wayback Machine are being weaponized to propagate and preserve health misinformation circulating on platforms like Facebook and Twitter.
Here we present two interconnected studies of data craft that leverage archived web resources to intentionally evade automated content moderation e orts and further propagate health misinformation from platforms attempting to combat.
This research shows how archived URLs of web archived sources of health misinformation found in the Internet Archive's Wayback Machine and screensampling practices of archived content appear to be di cult for automated content moderation systems to identify and ban, and as a result circulate longer and spread on platforms.
In order to understand these information practices that shape emerging COVID-19 publics convened by private platforms in the aftermath of pandemic, misinformation researchers and platform operators need to carefully consider how the circulation of archival content from the web now appears in platforms and its status in the wake of the coronavirus.

Implications
For many years, computer security researchers and internet researchers have documented the various ways that web archives, such as the Internet Archive's Wayback Machine, have been hacked, hi-jacked, and misused (Caplan-Bricker, 2018;Littman, 2017). High pro le cases typically point to zombie content no longer published on the 'live' web, provenance laundering with customized hyperlinks or link shortening, backdating resources, or blocking bots from crawling web pages for search indexing (Madrigal, 2018;Nelson, 2018;Walden, 2012). The coronavirus infodemic (Zarocostas, 2020) has resulted in a slew of data craft techniques propagating health misinformation, which now includes web archives like the Wayback Machine. By data craft we mean "practices that create, rely on, or even play with the proliferation of data on social media by engaging with new computational and algorithmic mechanisms of organization and classi cation" (Acker, 2018). Here we also discuss screensampling, a data craft technique that extends the propagation of archived misinformation when social media users post screenshots of archived URLs thus removing the ability to click or track these social media users post screenshots of archived URLs thus removing the ability to click or track these static images of archived online sources. Such data craft often allow misinformation and disinformation campaigns to go undetected, and prove particularly adept at avoiding the automated content moderation algorithms used to increasingly combat fake news and inauthentic behavior. One well-documented data craft technique is mimicking legitimacy by publishing fake content that appears to be credible information (Acker & Donovan, 2019). In their study investigating the misuses of web archives on social media, Zannettou et al. (2018) found that news articles and social media posts were the most common web resources to be saved in Archive.is and the Wayback Machine. They found that these kinds of URLs circulated among Reddit forums when the original web content was assumed to be controversial or ephemeral. The data craft reported here is designed to leverage the legitimacy of the Wayback Machine's archival infrastructure in order to deploy health misinformation into platforms by circumventing moderation e orts.
Platforms, in their algorithmic sorting and moderation, bring together new online publics, what Gillespie calls "calculated publics" (2014). Finn has shown in her work on information orders before and after disasters that private platforms like Facebook convene groups of people in novel ways (2018, p. 140). Finn argues that after disasters, like earthquakes and pandemics, platforms become public information infrastructures that shape and are shaped by new information practices. Coupled with automated recommendation algorithms and closely knit, targeted audiences, the subversive propagation of weaponized health misinformation now shapes the calculated publics of the pandemic.
How platform algorithms convene these COVID-19 publics are generally unknown, their mechanisms are "black boxed," providing outsiders with low visibility into their construction, development, and evaluation. Here we show how web archives are being used to mimic legitimacy and spread misinformation to COVID-19 publics through platforms like Facebook, which may in turn provide more insight into apprehending the power of algorithms to label and classify misinformation (Burrell, 2016).
Many content manipulators leverage the "context collapse" a orded in Facebook's newsfeed and Twitter's timeline to spread misinformation with free and fast online publishing tools. Because the newsfeed and timeline streams " atten" all content into one feed or social awareness stream (Kivran-Swaine & Naaman, 2011), it can be hard to distinguish between vetted news articles, targeted advertisements, and other online content. Further, Facebook's mobile app modi es web articles into their instant article format which they describe as a "buttery smooth" native feature (Facebook, 2020), providing a legitimizing data craft to content that would otherwise be perceived as sketchy and unreliable if viewed outside of the platform, at the content's original URL. Before we explain how archived content can be weaponized in platforms, understanding how the web is archived is necessary.
y Web archives such as the Internet Archive's Wayback Machine, come from the resource intensive, and purposeful digital preservation of digital material (Brügger, 2018). Web archiving methods and techniques are typically divided into two approaches-micro and macro e orts. Macro web archives usually are managed by large information institutions relying on web crawling, which involves generating "seed lists" and automating routine, repeated crawls to build robust, comprehensive snapshots of the quickly changing web. Web crawling techniques are the most time consuming and resource intensive because they aim to capture whole web pages and online resources by systematically "crawling" each and every embedded hyperlink in a website to capture each part of complex and layered web resources (Milligan, 2016). Crawls can have varying levels of automation, and seed lists are added frequently by web archivists to expand crawlers' reach. If web crawling is a macro technique that can be automated at scale, micro techniques are more targeted and less routine, such as API extraction, or focused on capturing dynamic features of the user interface with screencasts or screen shots. Micro web archiving projects usually are managed by individuals and small groups of researchers who want to capture particular slices of the web to illustrate an event, social movement, or emerging behavior. Web crawls are not limited to websites indexed by search engines, they also include individual web pages that users save through features like "Save Page Now" or previously the Alexa Toolbar (Rogers, 2017). Once a URL has been added to a seed list, they can also appear in many di erent collections and be captured by di erent automated crawlers. Our research has found that both macro tools and micro tools are overlapping to shape COVID-19 publics and spread misinformation across social platforms, which are increasingly used as public information infrastructures.
In their study of the Internet Archives' preservation of the North Korean Web, Ben-David and Amram found that knowledge generated from Wayback Machine web crawls comes from human and nonhuman actors, and "includ[ed] proactive human contributions, routine operated web crawls, as well as curated and appraised web crawls of collections, arguing that these archived snapshots are like other algorithmic black boxes (Ben-David & Amram, 2018, p. 195). Despite these routine web-wide crawls, individual human actors are strategically adding to the collections, for a variety of di erent intentions and memory practices. Many have argued that more studies of archivists' appraisal decisions and "web archival labour" should be conducted to understand the ways human and nonhuman actors impact the collections of resources that result in a history of the web (Ogden et al., 2017). Our investigation found that both human and bots were archiving online misinformation, but that more individuals used Save Page Now to archive a resource after Facebook moderated and agged the live URL as health misinformation.
In this study we sought to discover how online health misinformation is being archived and then weaponized using the data craft tactics of mimicking legitimacy with reliable URLs and a practice we call screensampling. In answering these research questions, we showed how data craft weaponizes web archives and impacts how platforms convene COVID-19 publics, contributing to the coronavirus misinfodemic. Here we argue that there is an opportunity for misinformation researchers to examine this relationship, between passive and active archival agents and their intentions to archive misinformation, as well as the status of weaponized web archives that evade content moderation and removal on platforms because of their trusted URLs.  These screenshots were taken of a conspiracy article, "CORONAVIRUS HOAX: Fake Virus Pandemic Fabricated to Cover-Up Global Outbreak of 5G Syndrome," which had been archived by the Internet Archive's Wayback Machine web crawlers on March 9, 2020 (The Millennium Report, 2020). The original article, which appeared on The Millennium Report website on March 2, 2020, was rst crawled and preserved by the Wayback Machine on March 2, 2020 (hereafter we use "the original URL" and "the archived URL" to refer to these two sources). By examining the screenshots, we were able to locate the original URL as well as the archived URL hosted by the Internet Archive's Wayback Machine at web.archive.org. Using provenance information from the Internet Archive's multiple web crawlers, we found that individual human actors had archived and crawled the web page, which then seeded bots for automated routine crawls of the website.
By using these archived snapshots as a kind of proxy rather than the original URL, web.archive.org links can easily bypass existing content moderation systems used by platforms. As previously described by Donovan, the "hidden virality" of the article was not in its original URL form, but instead in the archived version stored in the Wayback Machine (Donovan, 2020). The Wayback Machine web archive allows for a public, relatively anonymous (with no pro le or login necessary) means of spreading disinformation from the web and then hosting it-even when the original URL has been taken down or unpublished on the live web. This tactic of storing misinformation and highly ephemeral content enables manipulators to use the web archive as a distribution mechanism, allowing it to evade content moderation and live longer on platforms.
The hidden virality of Wayback archived URLs can be further compounded by the practice of screensampling, where digitally extracting an archived snapshot creates a new digital asset that can easily increase the spread of dubious content (as seen in Figure 1). Posting images of text allows human readers to view the content while bypassing content moderation mechanisms because image formats with text are not easily machine-readable. Screensampling excerpted sources from web archives allows users like @narvonocutz to recontextualize and propagate content from a trusted source (the Wayback Machine), while constructing a new post or thread of recontextualized content made of images that both evade moderation but also obfuscate the archived URL by disabling the hyperlink and shortening the original URL. Abstracted from their original source, these screenshots severe attribution to the original URL and spreading the content in an untraceable manner, and creating "memetic abstraction" (Chaiet, 2019). The archived URL of the Millennium Report's "CORONAVIRUS HOAX" article, as of this writing, has been captured with web crawls through March 2020 follow a pacing that matches external fact checking and moderation of the original URL on Facebook. We have veri ed N=17 di erent kinds of web crawls that indicate a broad ecology of web archive agents-both non-human and human actors. These web crawls seed a number of speci c collections at the Internet Archive, including collections of outlink URLs posted to Twitter, collections of Fake News, Archive-It partner collections that subscribe to Internet Archive's web News II" web archivists direct the seeding of web crawls but collections like "Live Web" proxy crawls are mostly fed by people using Save Page Now. Both broad and content-speci c seeds play important roles in appraising what will (and won't) be accessible in the future web (Summers & Punzalan, 2017), as well as when individual human actors choose to use Save Page Now.  (Kertscher, 2020). Reporting had found that there was no credible evidence con rming the claims in the "CORONA HOAX" article. Shortly thereafter, Facebook began issuing warnings to users intending to share the original URL (Figure 3 ) On the same day Wayback Machine web crawlers intending to share the original URL (Figure 3.). On the same day, Wayback Machine web crawlers "LiveWeb" and "WebWideCrawl" began to archive the original URL for snapshots. Both collections are fed mostly by the Save Page Now feature, which only saves a single page (Internet Archive, 2018).
While the original URL had previously been crawled by automated collections crawlers from March 3 to March 8, it was only after the article had been fact-checked and agged by Facebook that individual human agents began to proactively archive it with Save Page Now and Save Page Now proxies as compared to previously automated Wayback Machine web crawlers feeding collections like Twitter outlinks or Archive-It partners.
The spread of the original URL and the archived URL on Facebook can also be compared using social analytics data from CrowdTangle, which is owned by Facebook. They provide public analytics for how far the links spread on Facebook in the beginning of March, and the number of total interactions with the post indicating the popularity and reach of each URL. Table 1 shows that the archived URL circulated on Facebook outperformed the original URL in reach, engagement, views, and shares (Fraser, 2020). Although the original URL has now been fact-checked, agged, and moderated by Facebook, users still are able to post the health misinformation today. However, the archived URL of the same misinformation, as yet, has not been agged or identi ed by the platform as violating platform policies. Weaponizing web archives and screensampling to evade misinformation moderation e orts are not only data craft for platforms trying to detoxify their networks of disinformation and harmful misinformation, but also a new challenge for misinformation researchers, tracking tactics and developing new methods for studying online behavior. As data craft, screensampling becomes another hurdle for users, researchers, and platforms trying to determine the original source of the contentand social analytics tools like CrowdTangle will never be able to quantify the number of users who screenshot an article and then post portions of it with their own framing commentary. As researchers confront the pandemic there have been many calls for closer attention to information practices confront the pandemic, there have been many calls for closer attention to information practices, digital archiving e orts, data management, and the importance of preserving this moment (Xie et al., 2020). Here we challenge our research communities to consider the information practices of emerging COVID-19 publics and the circuation of misinformation stored in web archives because it reveals both a mistrust and awareness of platforms' current automated moderation and fact-checking e orts. As scholars of misinformation continue to examine the information practices found in private platforms that convene COVID-19 publics, we need to expand our scope of to consider the circulation of dubious information found in web archives and examine their status as they become weaponized in platforms with data craft.

Methods
Following on Ben-David and Amram's innovative method of collecting provenance information from IAWM web crawls (2018), we used forensic analysis to learn when human agents directed web crawlers to archive health misinformation related to the 5G coronavirus conspiracy (as compared to automated seed lists and bots that archive the web). The availability of web crawl provenance information data provided readily available data and descriptive metadata for us to analyze. Once we identi ed the original URL from screenshots posted on Twitter, we scraped the provenance information from web crawls of the Millennium Report's archived URL (beginning March 3, 2020 and ending May 17, 2020). Then we compared archived snapshots to the original URL using Crowdtangle, Facebook's social analytics dashboard, to measure the engagement between the original URL and the web archived URL to compare their spread and reception. By using Crowdtangle analytics to parse the engagement data and con rm greater spread of the archived URL than the original URL, we were able to identify hidden virality of a web archive URL that evades platforms' swift, automated moderation of harmful misinformation because it is hosted by a trusted web archive domain. In observing screensampling methods of archived URLs circulated amongst COVID-19 publics on platforms, the same web archives may be appropriated for di erent uses to increase doubt and spread dangerous and unreliable health misinformation.
Screensampling creates a memetic abstraction from the original source by converting a web resource into an image, resulting in a transmedia transformation of the existing content. In addition to converting text into a rasterized image, the screenshot may simultaneously encapsulate more layers of contextual information for researchers to examine, such as device's mobile network, timestamps, domain names, and other revealing diegetic user interface elements (Chaiet, 2019). While such identifying features are akin to traditional metadata found in digital formats, they are captured in the y g g y p amber of an image instead of the object's metadata that could otherwise be extracted programmatically. Screensamples (like Figure 1) are screenshots of text so subsequent contextual "metadata" are not machine-readable, yet these image-based messages are human-readable and can subvert text-based content moderation systems.
Like other researchers who have studied web archivists and their crawling decisions (Maemura et al., 2018;Ogden et al., 2017), we nd that individual human contributions played a role in the spread of this misinformation on platforms like Facebook and Twitter, as well as in its appearance in a number