>> Started to step in and -- Okay, recording at started. Sorry about that. I should've pressed that record button. So, these groups were or these organizations were all NDIIP partners. The NDIIP is the National Digital Information Infrastructure and Preservation Program. That's a big acronym which is the precursor to the National Digital Stewardship Alliance or NDSA, which is hosted and is a consortium of about 165 or so partnering organizations/universities, professional associations, government agencies, nonprofit, etcetera, who are all committed to the long-term of preservation of digital incarnation. So all of these groups were already doing web archiving of the dot gov domain at various levels and stropes [phonetic]. As you can see here, the Library of Congress was harvesting election material and other materials in the dot gov domain. The government printing office was harvesting agency websites and ephemeral documents. The National archives was actually doing web harvesting, but they focusing on congressional committees and congress and doing a web harvest every two years at the end of congressional session which of course was for two years. The internet archives was doing global national domain and curated crawls of all sorts. And executive agencies were also used in archivette [phonetic] to preserve their own domains. On the academic side, University of North Texas and Stanford and others were doing topical and targeted [inaudible] collecting. So as you can see, there was already a lot of activity going in the state by loss of institutions. But by coming together, we felt we could build a community of experts and interests around the effort of crawling the dot gov's dash dot mil domain. It's been on all of our collections and leverage economies of scale. So we did have some goals that were set up at the very beginning and those goals continue to drive the End of Term project. We've done an End of Term in 2008, 2012 and 2016. And we seek to work collaboratively to preserve public US government websites to document the federal agency's presence on the web during the end of presidential terms when the dot gov web space is particularly volatile, enhance the existing research collections of all of the partner institutions, ways awareness about the need for preservation and engage with researchers and subject experts and the public to move the project forward. Excuse me. I have a bit of a cold today. I hope I don't sound hunky. So in the many hands make light work model, all of the participating organizations have taken on various pieces of the work at hand. The main actual crawling organizations, internet archives, Library of Congress and university of North Texas in 2008/2012, as well as George Washington University this time around which is harvesting social media for us. But there's also preservation, data transfers and indexing and metadata creation and indexing and a web portal which is a California digital library is hosting which includes the indexing of all the metadata. There's nominations, URL or seed nominations we call them, which GPO is doing in terms of outreach to their community to the FDLP community, the Federal Depository Library Program community. And all of us are really doing seed contributions, outreach, project management, education, coordination and the whole lot. It's really only eight to ten of us that are working on this project, and so it takes a lot of our time to do this. At first we were all focused on almost primarily on collecting and preserving the dot gov content. But over time, over the last iterations of the crawl, we've also worked to expand access first through the web portal that CL's hosting, but then also expanding the ways that researchers can use the archive. And Jefferson will talk about that in a few minutes. So, there's the technical work of harvesting and metadata and indexing. I saw in the chat that someone really loves metadata, so there's definitely space for you here. But also much outreach and a coordination of volunteer nominators and nomination events and journalists and public inquiries. The number of volunteers has grown especially this year with the perceived notion of the loss of government information and data, especially climate data and extreme public interest in preserving and assuring access to government information. In some ways, it's never been a better time to be a government information librarian. This is the slides. Loads. So we have a ton of funding, actually no. There's no outside funding or grants for this project at least that I know of. It's been paid for by the organization the institutional resources, the web archiving programs and the people participating in this project. It's pretty much a sweat equity kind of project. And it's done well in that manner, not to say that a big grant wouldn't be welcomed of course, but I think it's highly commendable that the people and the organization's working on this project are doing so because they understand the importance of digital preservation and public access to government information and all of its guises. I'm not seeing the arrow up here. Hold on one second. Yeah, here we go. So, defining exactly what is the government web presence, has been one of the biggest challenges for the project. The government domain ebbs and flows, contracts and expands, mostly expands and goes beyond dot gov and dot mil domain. In 2008, we started with the data set from NARA's 2004 crawl and we created and collected other succeed URL, for example from a Stanford web-based project and have grown that list over the three crawl cycles. The first even the US government didn't have a comprehensive list of top level domains and subdomains. This time around, it's been a bit easier to get bulk lists from the government itself as the government transparency efforts of the Obama Administration has spurred the federal government to make that information publicly available. So we've got lists from data.gov and USA.gov and 18F which is the internet services group from the General Services Administration or GSA. And this year we've also received for the first time a bulk seed list from Google, perhaps the largest web crawler out there. So we start with this bulk lists of domains. But then we also have a ride on volunteer subject experts, and this year especially other grass root efforts that can help us identify specific seeds that are important to capture. But that may not be at the top level domain bulk list. It's important to get nominations on top of having bulk lists to assure the most thorough crawls as possible. On starting with some very positive press in the New York Times in late November 2016, our efforts at community engagement have really taken off. A special [inaudible] to Professor W. Rubina at Prep University Library School. She was a very active early supporter of the End of Term project and had a library class in 2012 and again in 2016, which helped identify social media and other seeds. And she's also put together events of subject experts to do seed nominations for us. Other events and efforts include the Data Refuge Project started by the University of Pennsylvania libraries and the University of Toronto's Guerilla archiving efforts. Other libraries and organizations across the country have hosted hack-a-thons and nominate-a-thons or whatever we're calling them these days and are both collecting their own data. For example, the climate mirror, which I didn't put on the slide, but you can look them up, climatemirror.org, as well as feeding seeds and data into the End of Term crawl or End of Term archive. These efforts are not wastefully duplicative, but instead have helped us all learn and make for a more thorough web crawl. We've gotten a lot of press as I said over the last several months. You may have seen the Wired articles and the New York Review books and several others. So for the police presence outside. It's not me, I swear. So, as I said, we've gotten some really good press in the last few months, which has sustained the interest in the project and in preserving born digital gov info and data. This has been both a blessing and a bit of a curse. I might say that only slightly facetiously. It has definitely made outreach easier and has made arguing for participating in the End of Term crawl easier for those of us who are participating with our own administration. And we're thankful for all the public interest and the public events and the positive energy. But the amount of journalistic interest has meant spending a lot more time for all of us volunteer explaining the project and the ins and outs of web archives into recorders. So our [inaudible] means of organizing volunteers has been through the University of North Texas. It was a nomination tool. And some of you may have seen this if you've been tracking on the End of Term over the last six or eight months. In 2008, we had 457 seeds nominated from 26 nominators. In 2012, that was -- the number of nominators slightly grew and the number of seeds grew a lot to 1476. And then in 2016, this is the year with all the public interest and the public angst, we counted many more nominations and nominators. So we've had over 15,000 seeds nominated from over 400 nominators just through the University of North Texas' nomination form. And we both had over a hundred thousand feeds nominated from events and tools hosted and created by Data Rescue, as well as the environmental data and government initiative or edgy. Both of these groups ran lots of public events. So this has been truly phenomenal and has helped spur efforts by a group of libraries coalescing around the website called Libraries.Network to collaboratively work on the preservation of digital government information going forward and not just at the end of each presidential term. This is Jefferson's favorite slide and I'm sorry that I stole it from you. The dot gov sites proliferate of course just like invasive species. And yes, there once was an invasive species dot gov website, which went dark, but the internet archive has it in the link back [phonetic] machine. So government websites are ephemeral. And in many cases, can be difficult -- it can be difficult to gather them as they expand and change over time. And that's social media and other kinds of web apps and tools take hold. There's dot gov and dot mil of course as I said. But there's also a lot of government information on the dot com domain. For example, all of the Armed Forces have dot com web sites. And the dot org domain, which includes many commissions and the Federal Reserve, Federal Reserve.org, as well as YouTube, Twitter, Facebook and other social media. So we've had to rely heavily on nominations from subject experts in these areas. Sometimes the bulk seeds that we get can get to these sites crawlers following links as they've been [inaudible]. But it's always great to have the community help identifying seeds. There are of course issues with web crawling as a viable option to collection and preservation. Nobody is saying that web crawling or web harvesting is a perfect tool. But it is what we have. So, some sites of course are not publicly accessible or are unlisted. Some are only found through dogged efforts of subject experts who know where to look for and know how to identify them. This is especially true for things like data sets, underlying mapping and other web apps. There's some content we're unable to archive like sites behind passwords, databases that need to be humanly queried, though the internet archive has some cool work arounds to this problem. The proliferation or invasiveness of government websites continues almost unabated, meaning that overall, the dock up space continues to expand much like a supernova, though I hope that metaphor doesn't prove predictive. There was an effort by the Obama Administration's Office of Digital Strategy back in 2011 to tighten up the dot gov domain and curtail what was seen as "duplicate and unnecessary websites that waste money." This website too was lost to the sands of internet time, which according to the internet archive is about 100 days. It might be a little bit more or a little bit less. I'm not sure if Jefferson can corroborate that number. So all that said, we're trying to make sure we're as thorough as possible. But of course, this is a tricky and imperfect business. The End of Term archive, the 2008 and the 2012 archives are currently available through the California Digital Library's hosted portal at EOTarchive.CDlive.org. And the 2016 crawl data will soon be available there. We stopped crawling at the end of March and so we're processing and creating metadata and all of that will soon be available. There will be at least three cashes of the content at the internet archive, the Library of Congress and UNT. And we're looking into other ways to make this content accessible of course. And with growing scholarly interest in corpus and network analysis, bulk data and distant reading, these archives are ripe for reuse and bulk access. We're already getting questions from researchers at Stanford, wanting to -- wanting bulk data. So we're hoping they'll go with the internet archives. There are also some affiliated efforts that are paralleling the End of Term crawl. And so it's not just that we're crawling once every four years. There are other targeted efforts. For example, the internet archive has an election archive and a White House and social media archive. The Library of Congress has a campaign archive, and there are several others, mostly having to do with elections and social media. President Trump's Twitter feed is perhaps the single most archived Twitter feed of all time. Part of it is editorialization, but few can resist rubbernecking [inaudible]. And with that, I think I'll hand it over to Jefferson. If Jonathan, you can switch the moderator to let Jefferson control those slides. I'll give it away to Jefferson. >> Sure, it'll be just a quick second. >> Thanks James. And my thanks as well to the SJSU crew and Jonathan especially for helping get us set up and coordinating. >> Alright Mr. Bailey, you should be able to do it now if you want to move your cursor around, it should be working. >> Let's see if I can get on the next slide. I like the blackbird collaborate operate [inaudible] dialogue bandwidth speed. It's very -- It's just the technical choice. Cool, so thanks everyone. Thanks again for having us and letting us talk about the End of Term. So this has been a pretty awesome year for government web and data archiving as James sort of gave the background. We'll talk a little bit about what some research used that we've done. So this is the third year. It -- as James mentioned, 2008/2012 and then now this year. So the prior years gave us a little opportunity to do some comparative analysis and we'll -- we're just starting to get going on that for the current collection. But continue the slides from before, James mentioned a number of affiliated efforts around this sort of outreach getting students in classrooms involved. I think somebody put it in the chat, but there's a big effort that UNP is leaving this year to get -- to basically crowd source some of the metadata for the PDFs that we're collecting as part of the crawling, which are going to number easily in the millions and probably even more than that. So, you know, when you go to PDF from website, it doesn't necessarily come with a whole lot of metadata. It might be even better than the PDF, but there's certainly no other information that's easily captured except for the URL that was on. So, identifying government documents that had been collected via a web crawling is a really interested -- interesting challenge that I think syncs itself well to crowd sourcing that's work and many other descriptive application contacts. So, the other affiliated effort, it's, you know, mentioned the crowd sourcing matches that collecting and people nominated websites, but also like helping describe the content and analyze it and figure out what we have. Well we also -- we're lucky enough to work directly with the White House this year, and they reached out to us because the Obama Administration was very cognizant of the openness. And public domain made sure of all the content that they had created, and they wanted to see it in as many place as possible and being used by as many people as possible. So we were just -- we internet archives and it's just going into the end of term collection, so it's available to anyone for download. It's totally public and downloadable. They gave us [inaudible] of Twitter and Vine and Tumbler and I think one or two other social media channels, as well as giving us basically a list of the whole White House.gov architecture so that we could get a very good snapshot of it. So, that was cool, and I think that sort of built off of, you know, not just open access initiatives, but also the recognition that go data and gov web data is -- it changes a lot and just because administrations change and administrators and agencies change and politicians change and things like that. So that was really cool that happened this year. So it is part of the collection as well. So how big are some of the end of term collections? You'll see in 2008, we were not a very large what we call, seed list. A seed is basically just a website and that sort of web, a crawler starts to crawl. So it sort of tells the crawlers where to start and then they go out from there. So we collected around 102,000,000 URLs or 160 across Library of Congress internet archive and UNP. Sometimes we focus on different parts of the dot gov and dot mil. Dot mil is military and so some people will focus on specific agencies or specific parts of the federal government web to collect them more intensively and more deeply than the night if they tried to get everything. It was about 18 terabytes in total and with some, you know, duplication. In 2012, the seed list was almost twice as big and we got about the same number of URLs. And most -- that was mostly because it was being duplicated which means content that could not change. We did not archive a second time, but basically just make a note that it hasn't changed and there you go. It was a little bit bigger, but generally not all that much bigger. So there were fewer URLs but about the similar data size. So, the reason that -- or to capture that sort of change has been -- you know, the web has become much more un-needy of platform, not just text on a webpage, but there's audio and video and animations and those tend to be quite large files. And that sort of mixed size similar even if it's much fewer actual individual webpages. So that's a [inaudible] review of what the first two crawls look like. And those are the ones that -- And this was mostly the work of Mark Phillips, who is I think the director of digital libraries at university of North Texas. I might have gotten his title wrong. Maybe he's just with the university librarian. But he does amazing work at UNT on digital library stuff. And so he had a an IMLS grant to do an analysis of these two crawls to try to get some comparison because the three institutions that are doing a lot of the actual, you know, going out and grabbing and archiving, there's others contributing in other ways too, but this was the interesting comparative point because we're mostly all using the same tools as far as crawling and indexing and way back and stuff like that. So what Mark did was basically, you know, take big index files of these first two end of term crawls and compare them to see what individual institutions were getting differently from each other and what the shape of the overall collections look like. So you'll see on the right some pie charts around what each institution was collecting that was unique just to their archiving efforts and what was shared. And so you see even though in many cases we were using some of the same seed list and focusing on the same websites. The uniqueness of our individual archived URLs was quite hot. So that's very cool. And then on the left, a super interesting statistic which is comparing the PDFs that were discovered in 2012 with the ones that were discovered in 2008, and how many were gone and how many were still online. And, you know, it's a pretty shocking percentage of PDFs. And you know, you could see little bits of three million missing and 775,000 found. So that's a pretty shocking metric of change across only four years, especially for PDFs which tend to not necessarily be as endangered as like a webpage because they're less likely to change. So, that's a super fascinating number. You know he did some other kind of bar charts to look at the sort of crawling schedule. And the end of term essentially aims to archive for about three months before the inauguration and then three months after so that we get a good snapshot on either side of a presidential change which is when the government web change its most. And at the agency level, obviously WhiteHouse.gov generally changes entirely if a president changes and other aspects. So, this is sort of looking at a very intensive crawling [inaudible] can't make. The dates are a little blurry, but the huge green spike is basically the inauguration, middle January. So we got something. You ramp up the crawling around then as far as like the URLs that we're trying to capture right before they might change. So we'll make the slides available so you can get a lot of these. A TLD is a top level domain. That basically means like, WhiteHouse.gov. White House is the domain name, so it's called a top level domain because all the pages are below in that domain name. So, it's -- You can think of it just sort of a website homepage URL. And so we look at TLDs a lot to see what's the extent of the government web at least, as far as [inaudible] eastern domain names. It won't necessarily account for the size or the number of total pages. But it gives you know, a good sense of how name spaces change and that can be indicative of, you know, constant change as well. And so the number of unique TLDs is actually pretty similar in the number of common TLDs remains pretty static as well. If we look at the unique domain names, then that can include sub-domains and basically anything to the left of dot gov. You'll see quite a growth in unique domain names over the four years. So that's quite an interesting stat in that people started making more elaborate and descriptive domain level names, which just means that, you know, the complexity of their websites was changing and [inaudible] was increasing without them necessarily having to register full domains. Sub-demands as I mentioned, similar statistics -- these are mostly showing the expansiveness of the gov web over time just three with the registration of domain names. So it gets interesting when we start to look at the ones that actually change the most. So com, gov, mil, net, edu, int, some of these others, you know, dot LY of course is going to be a URL, shorteners, it, dot lee or other dot LYs were early web domains just for URL shorteners. So you see, hey, wow, there's a lot more URL shorteners. So in 2012, it became quite common between the two crawls. And that's recognized here. Dot med probably not very used all that early. And you know, 2008, [inaudible] government web. So, let's see. Anyway, that's TLD change [inaudible] we'll keep going. There dot LY and dot MEE, so dot MEE might not have even existed in 2008. I'm hardly sure. But you can sort of start to do this comparative analysis that allows you to see how the web has changed sort of in a holistic view. Some other interesting ones are when you get to sort of more specific sites, you can see this is I think number of URLs that were archives. And so when you look at the ones like maybe or house.gov and SIMA [phonetic] dot gov are kind of the obvious ones, but some of these sites got much bigger between 2008 and 2012, especially if they are publicly focused on delivering, you know, content and news and serving their constituents. So that was some other interesting ones. What are some ones that disappeared between our first two end of term archiving efforts. Geodata.gov and you can look some of these up. We'll share the slides and look at the Beta Way Back machine or what not since they'll be in 2008, not in 2012. And you can see the URLs sort of indicate the relative size. So, you know, geodata.gov was actually quite a big website to entirely disappear over the course of four years. So some of these are more specialized or you know, may be more specific government initiatives that only had a limited timespan to begin with. But in others, it may have been assumed into larger websites. DARPA obviously is the Defense Agency Research Project Administration. The probably got folded into a dot mil domain or became a subdomain of something else. So, you get this sort of disappearance and the movement of content around the government [inaudible]. Show you the error. Not sure how to advance this line if I can see the error. There is it is. Okay, good stuff. Okay, so that was sort of mostly Mark Phillips doing his own research on 2008 and 2012. And I thought I would talk about some efforts to give data mining and computer scientists and other researches, big chunks of dot gov web archives data for them to work with. So one effort we did was working with political scientists at University of Washington, social scientist at Rutgers and information scientist at Research Institute in Germany called AL3S [phonetic]. And what we -- these people were interested in like all of dot gov's web archives, pretty much as much as they could get because they are running huge high performance computing networks where they're looking at millions and billions of things at one time and very big processing jobs. So this was a sort of research and development effort to how can we support data mining projects that are interested in doing analysis similar to Mark's, but like on the content itself and not just the index. So this was the sort of dot gov project and we took a hundred terabytes of dot gov which included all the End of Term, as well as some additional stuff from internet archives, [inaudible] archives, and basically put it in a third party platform called ultra-scale and said, hey, common used data all of you people who are interested in data mining stuff. And pretty much nobody showed up. So there were one or two projects that came out of it, but for the level of effort involved in moving the data from the internet archives to ultra-scale, getting people a user accounts and, you know, helping manage their processing efforts. It turned out to be --have many hurdles around the user management and people just not necessarily being able to deal with this big tranche of data. So their challenges, because a lot of them weren't necessarily familiar with web archives or how crawling worked or, you know, how there wasn't really any metadata except for what is in the website. So, it exposed a lot of interesting challenges that I think the web archiving community is still grappling with. But it is a big slice of the government data that is available for people to do computational research on. Another effort that we tried to make -- So if a hundred terabytes is too much data for you, how can we make data sets out of this collection that are smaller and easier for people to work with and might have a more understandable shape and content. And so we've been trying to make web archives data sets, which is, like it's not the whole web page. It might just be all the page titles or may be all the links on a page. Her other sort of metadata are data are data elements within a webpage that can be extracted and put in a data set and then given to [inaudible]. So you know you can have like metadata like the links or the page title. You can have all the people names or place names in a whole website extracted and with a timestamp and all these other characteristics. So we won't have to go into too many details, but with another attempt to get researchers the data that's in this awesome collection. But in a way, that might be easier for them to use. And doing this work with a number of these institutions making data sets has actually led to some pretty fantastic community events focused on helping researchers use web archives. So there has been a number of hack-a-thons at Library of Congress. We have one here at internet archives. There's the archives on the [inaudible] project, which basically brings researchers together for two or three days, gives them government data and they play around with it. And they get a little expertise and handholding. They've done similar things in Europe and in Canada as well. And there's an online workshop. It's oriented towards the semi-technically proficient as far as the ability to run scripts and stuff, but if you are interested or have some fluency with the command line terminal or things like that, then you might want to check out our guy's research workshop. But basically dot gov has been a good entry point for researchers that are interested in using web archives, but don't really know where to start as far as the content because we have so much web archive material across all our web archive institution. Dot gov sort of makes sense for people and also of course it's public domain, so it's very easy to just have people use it without getting into too many challenges. If folks don't know about the new Beta Way Back machine, it is basically already out in production, no web-beta.archive.org and we have built a whole portal. It's still in testing, but it will be out soon, but is the new Way Back machine which has key word search and some metadata search. And we're doing that on web archives. So if you're still just interested in the traditional model of looking at an old website, we should have some search capabilities in Way Back machine for that just for the dot gov stuff. And we're also doing similar kind of stuff with the statistics. So you can see the screenshot on the left. It's just sort of a raw information and on the right. And this is part of the new Beta Way Back machine. You can look at the summary stats for a whole host or demand. So here justice.gov is the DOJ and you can, you know, search on specific types of content, how many PDFs that they have, how many vintage files, and yet you know a little charting. You can tell it what year to look at and stuff like that. So there are other ways to [inaudible] to replay the webpage and you can also interact with some of the high level staffs. So it's just yet another effort to try to build interesting new access points and ways to use the [inaudible]. And then we've also extracted every PowerPoint and PDF and made full tech search on it. So some of these aren't quite out yet, but they'll be in a blog post in the next month or two. But you should basically be able to put in key words and say something like mine type PDF and the year. Get me every PDF that has access [phonetic] in the name from a specific year and that should go across the whole collection. So that would be an interesting way to look at specific types of media not just the webpages itself. Using that, we do sometimes special little collections. So we extracted every PowerPoint collection from dot mil, the whole military/government web, all public, all on the public web. And you can click and, you know, get them on the regular [inaudible]. And so we just took every single one we've ever seen and put them into a special collection and turned the PowerPoints into PDF and then OCR dump so that you can search the text of a PDF and of course you get the awesome crazy graphics of military PowerPoint presentation, which are insane. Okay, so that was little context of research support and compared to the analysis of the prior [inaudible]. And as James mentioned, we just kind of wrapped up the post-inauguration bookend part of the crawl for this year in late -- or basically the end of March middle and April, so maybe a month ago. And we are in the process of indexing and processing and doing some of the backend work that's necessary before it goes up online. This year was exciting because not only were there more public events as James mentioned and more crazy amount more nominations, but we also had a lot more partners directly participating in the project. So not just the core members but, you know George Washington University collects a lot of social media, so they were very focused. The government publishing office actually has a very large web archiving program. So they were very involved this year and helping do QA and find websites and stuff like that and other across the federal government and Stanford and James too, so a lot more partners. So that was exciting. You know, here's kind of what the overall timeline looks like when it happens. As I mentioned, we basically start crawling [inaudible] and do it three months before the election and three months after. So, you know, we mostly finished everything in March and I didn't -- And we are probably going to continue a little bit, at least here at IA because there's been so much interest and we've gotten so many nominations of websites and they keep coming in, and so we didn't want to just end the whole project. So we are continuing to seed nomination into government web data archiving projects. So it's not End of Term because we want End of Term to have a start and an end so that it, you know, it's comparable to prior years and has a sort of organic hold. But it doesn't keep us from continuing to archive the government web. Our strategies, this is an apparent one. We wanted to have more people doing more crawling, more affiliated communities involved. We had a list serve. I've had a lot more members from prior years and worked directly with the White House. So we mentioned our work directly with the government. We mentioned the White House, but we worked also with data.gov. We worked directly with the GSA which as James mentioned is the internet services provider for the federal government. So they get a seed list and helped us make connections and hooked us up with people that could, you know, point us to ATIs that might have more feeds and registries and other things that we could scrape to make the crawl better. And we adopted the portal and saw some of the other access that I already mentioned. Our opportunities is here. We're not just to have more crawling. And we also used a number of new crawling technologies. But through some of the coordinated data refuge and edgy events to you build web [inaudible] capacity and other organizations. So some of those events, people are actually crawling or they work for institutions that are also crawling. And we've gotten a lot more research on engagement. I'm running a little short on time, so I'm probably going to maybe try to plow through the rest of these in the next couple of minutes so we have time for Q and A. And I know there's been some good questions in the chat box which I haven't quite been able to follow. But somebody is crawling them and we're [inaudible]. One of our challenges, well the web just continues to grow, the volume, the proliferation, the amount of media and YouTube and video is just crazy. So James mentioned that sort of passion project of the institutions that are involved. So none of our day jobs to do this -- there's no dedicated staffing. So the growth of the size, as well as the growth and the attention to the issue this year has been a challenge. We don't do much cataloguing or QA. We mentioned efforts to try to help on that with PDF and stuff. So that would be great if we could get more crowd sourcing because it's as I'll show in a minute, it's hundreds of millions of individual files. So it will be important. We have more partners which meant more project management, which of course is always a challenge for sweat equity projects. We had a lot more seed list which is a good thing actually, and then some of the other limitations I already mentioned. What did we start with? We actually -- There is a registry of every registered social media account and the federal government. It's about 9,000 and I put the percentages there. We made a real effort this year to find every single website that has ever existed for dot gov's in crawl [phonetic] because if it doesn't exist anymore, it doesn't really affect the crawl anymore. So I think our final seed list was about 190,000. And if you remember the slides from the early ones, they were like 3,000 and 5,000. So we would just get everything as far as putting it in the seed list and let it go. Lab or crowd source nominations as we saw before and then some other donated lists as well. What does it look like since we just sort of ended? Well, here and in that archive, we have 240 terabytes, so that's pretty huge. It's about a quarter of a petabyte. So that's a lot. We have basically two different sections of it. One is all the web stuff and then we did a lot of FTP and that is mostly where in the 90s, FTP servers would host data sets. So it's not really web pages. It's more just a storage kind of mechanism. And you can archive it. It doesn't really replay because there's not really anything to look at except for the files. So you just end up [inaudible] look at the file. But we really tried to preserve those because they're endangered since FTP is not used all that much anymore, but a lot of the government FTP sites are still up. For responsive web URLs, we got over 300,000,000, so that's a pretty good number and about 12,000 FTP files. And we're working on some of the host stats. Library of Congress got about 35 terabytes and UNT's got about 18. So we'll be merging them all together and figure out what was duplicated and we'll [inaudible] duplication and then we'll get some sort of entire crawling effort stats. Our entire End of Term collection, which I think is -- You can search for it on archive.org and you'll find it, all the files are downloadable. It's, you know, in web archives format so they're not necessarily user friendly. We put them in a pre and post inauguration collection so that people understood how the two snapshots happened and we'll be blogging more about it. Lots of conversations with researchers. We have ATIs and development, which is just a programmatic way that you can interact with the files and the data sets that I've mentioned. And we are still continuing to archive and we're continuing to work with edgy and Data Rescue and others in getting -- They're collecting -- they've been collecting and we're going to take copies of their data too, so that should be pretty exciting. I mentioned that we're going to continue to crawl even though EOT is over, so there's a new UNT nomination tool with government web and data archives, and there's URL, so we'll share it. Anything you put in there comes to us and we crawl it. So, if people are interested in continuing to nominate, I certainly encourage them. There's what the collection looks like in an archive data work and you can see the two pre and post collections. Yeah, alright, last slide. So we have a little time left. You know, what are we going to do in the future? We're going to continue to crawl and form partnerships and work with community efforts, as well as with other libraries and archives to both capture, but also to make acceptable. We're going to read you the portal -- EOT portal that says by California Digital Library, as well as figure out some of the other access methods that I gave a high level overview of. Everything's public. If dot gov, it's public domain. So, if you want it, take it. If you want some of it, take that part -- all of it, take it. It's a lot. And then continue with the community building. So data refuge Genuity have just done amazing work in hosting what must be over a hundred events at this point and building tools for people to have their own data-a-thons or hack-a-thons or nomination [inaudible]. So really the community building this year which is something that those of us do on the web archive, we don't have a lot of time or bandwidth for has been fantastic. And there's the web server and I think this the last slide. So I think we're done as far as us presenting. >> Well thank you both for that fantastic presentation. I know we do have only a couple minutes left. We have some questions here. The last one I saw is the first one that I saw that didn't quite answered was one my questions. It was a follow-up to Grace. Now Mr. Jacobs was answering, and basically what I was asking was that now that you are tracking FTP, do you think you'll be able to make the comparison between the data from previous administrations to the 2015 collection of data if that makes sense? >> Yeah, it does make sense. I thought I added something in the chat about that. I don't know that we can -- that we have anything to compare 2016 against because we didn't harvest FTP sites for the 2012 and so we really can't make that analysis of where's the growth or what has disappeared, although Jefferson mentioned the pre and post crawl or the pre-election and post-election crawl. And so it could be that there could be some comparisons there. I know there's been some news recently about the administration scrubbing the word or the concept of climate change from the EPA website and from the Department of Interior website and so they're -- You know, some of that is happening and we may be able to analyze it later. >> Great. T asked the question why did you need to transfer it? And if I recall, this was during the part of the presentation about the -- a couple of public access sites that you had created. Is that correct Kate? >> Per researchers now they're platforms. That is what she says in the chat is a qualifier. >> Yeah. >> Yeah so that one is a -- was a third party platform, so we don't really have -- Right, so it's -- it all lived here, but we all give researchers access to our internal computing systems for them to do analysis. So it basically got taken from the air and put into a sort of more public repository for people to then data mine on if that makes sense. >> Great and Grace asks do the websites tend to have good metadata describing them? >> Actually pretty good. Yeah, so most web administrators or people that are building a site will add things like metatags or information and the HTML that actually describes it pretty well. It's not metadata quite like we think of in library archives land. But it is highly descriptive of the content of the page and of the resource itself. So there are ways to extract it or repurpose it for what we would think of as bibliographic metadata. >> But Jefferson did mention that names and other types of information can be extracted also and that makes for more robust metadata as well. >> Great, a couple more questions that it looks like they're coming through. Amy asked what would you say are the key skills one should have to work on the EOT? >> That's a good question. >>Yeah, you know -- >> Go ahead Jefferson. >> I was just going to say traditional like, archival, territorial skills. You know, we sort of know where the crawler starts and we feed them a list to do their thing, but it's our managing what they do. And this doesn't take high level technical skills, but looking at reports that we did from what's happening, identifying, you know, what are high quality sites that are going to have a good content as opposed to things that are probably mostly junk. And that can be like crawler trap or things where, you know, the web the crazy and so it breaks a lot. And it might be capturing URLs that basically don't have any text on them. And they're sort of ways to do that analysis. So, it's kind of like weeding or appraisal in archives. I think those skills adapt super well to web archiving because the scale is really large, and so you have to be very cognizant and targeted in analyzing content. >> It also helps to have subject expertise I think because in the dot gov domain, if you understand the provenance and the organizational structure of the government of executive agencies versus commission versus congress, and then you can sort of -- You can understand what you're collecting better and make better post-crawl decisions, but better QA decisions and those kinds of things. I've only had to do regular expressions, you know, a handful of time, though I'm sure that Jefferson, you know, seeks regular expression fluidly. So -- >> No. >> You say that in a regular expression. >> [Inaudible] engineer. >> Also a dogged interest in collecting information is a good skill to have a or a good I don't know if that's OCD or if that's being super-focused, but it's something of interest and, you know, I spend parts of pretty much every day that I'm working, I'm, you know, reading the news and finding the report that the news article cryptically said, "a report just out by this agency said today," but doesn't give the title or any other information. So I spend a lot of time digging those things out and either archiving them as fugitive documents as we say in the docs field or downloading them and saving them and archiving them to whole new digital repository to the Stanford Digital Repository. I think of the [inaudible] files. >> I also give a final shout out to my beloved and needed to promote skills project management. So this was a -- >> Yes. >> You know, I had a lot of partners who were all very strapped for time. You know, web archiving is a pretty big scale so there's lots of [inaudible] not just [inaudible] questions but just like a resource questions. So I think any institution that was involved needed good project manager. Angeli [Assumed Spelling] also had to, you know, work together on the overall project management. So [inaudible] cannot overlook skill. Now I started a collaborative projects where there's different -- You know we have the government. We had academics. We had ups which is just a nonprofit, so it's different types of institutions, people at different levels of experience and seniority. So all that coordination takes, you know, it takes skill. >> Great -- I know we have one last question if you don't mind and that was from Mary who asked are there any legal issue with crawling and collecting the data, and are there any legal issues with individuals and groups using the data? >> No, it's pretty much, anything produced by the federal government is considered in the public domain. So if it's in the public domain, there's no challenges around collecting it or making it full -- Just a minute. Even Bianca open that stuff. It's just totally public. So but you know, James is a copyright expert, so what he thinks is fine. >> I'm the anti-copy right expert. >> So you'll be perfect for [inaudible]. >> And I would just have to reiterate that, you know, public domain. The federal government is -- their information is in the public domain. There haven't been any issues that we've seen, at least that I've seen. There are -- Peoples of the government information [inaudible] that actually are copyrighted. For example, technical reports that are written by an outside person, not a person paid by the government, so not a government employee, sometimes have copyright on them. But we've collected for example the NASA technical report server has been harvested, and I'm sure that there are copyrighted reports in that archive within that database, but we've never had anybody come and ask us, you know, hey, take that down or hey, that's my technical report. You can harvest it. And so I think we haven't run afoul of that. >> Well thank you both for a fantastic presentation and excellent answers to our questions. Just before we leave, I'm going to open the floor to Dana and Mary if they have anything they wanted to say before they go. But I just wanted to thank everyone for attending. I wanted to thank Mr. Jacobs and Mr. Bailey for their time. They're both very busy and we appreciate having them. And I also just wanted to thank our partners over at SAA for helping us to get the word out about this event so we could have as many people here as possible. >> Thanks Jonathan and Mary and thanks to Jefferson and James. This has been a great presentation. We've been live tweeting like crazy because there's so much good information to share out there. I appreciate you both being here and answering our questions and I appreciate everybody who attended. We had a great group here. I'm getting to put in the link to where the recording will be via collaborate and then after we get the captions and everything converted over, we'll actually make it available as our recording on YouTube as well. And if you'll -- Kate's put in our website information and also some other ways that you can track us. So if you'll watch those, then you can see how to get access to the recordings. >> Thanks everyone. We really appreciate your inviting us and letting us talk your heads off for an hour. >> Yes and I second that. Thank you so much for having us. It's fun to talk about this project and great to see the interest so thank you. >> We'll see you online >> Have a great night. >> Bye. Thank you. >> Thanks all.