WEBVTT

1
00:00:04.191 --> 00:00:13.050
Taylor Faires: Hi, all it's wonderful to see you all and as as you heard we're from Authors Alliance

2
00:00:13.051 --> 00:00:37.430
Unknown Speaker: and we're gonna talk about legal issues and text and data mining and the goals. We have a few goals, I guess. One of them is to sort of chart a pathway through a pretty complicated or sometimes confusing legal landscape to give a sense of what you can do, what you can't do. And in particular, we're gonna talk a lot about fair use,

3
00:00:37.431 --> 00:00:49.741
which you probably have some familiarity with; the DMCA, which you might have a little less familiarity with, and some exemptions to that, and then also contracts.

4
00:00:49.941 --> 00:01:15.951
Taylor Faires: We'll talk about that, which everybody I know has had to deal with some sort of contractual issue license issue that's often a really big challenge. So Authors Alliance, for those of you unfamiliar, Authors Alliance was formed about 10 years ago. Actually, we have our 10 year anniversary in May. We're gonna have a big party out in San Francisco. The Internet Archive is hosting us for it for us. So

5
00:01:15.951 --> 00:01:31.970
if you can figure a way to make a trip to the Bay area, it'll be a fun time. So 10 year anniversary, we were formed in 2014 and now we have about 2,600 members, and

6
00:01:31.990 --> 00:01:58.840
Taylor Faires: they kind of range the spectrum from like multiple Nobel laureates to people doing fan fiction and journalists and kinda everything in between and what really unites them, I think, is this desire to see the public interest benefited through their work? Wanting to advance knowledge, wanting to advance research. And that's kind of what we focus on mostly on legal issues.

7
00:01:59.401 --> 00:02:03.691
Taylor Faires:  .

8
00:02:04.521 --> 00:02:05.471
Taylor Faires: .

9
00:02:08.941 --> 00:02:35.570
Taylor Faires: yeah, alright. So this is what we're covering. This project, we've been working on text data mining issues for a while now. Several years and actually, it's kind of part of where Author's Alliance came from. So we originated out of, for those of you remember, the Google books lawsuit. Google started scanning books way back in 2003, 2004. They almost immediately had a class action lawsuit filed against them.

10
00:02:35.731 --> 00:02:45.270
Taylor Faires: And that was brought by the Author's Guild, different from us, and in the midst of that litigation there was a big

11
00:02:45.271 --> 00:03:07.640
fight in the courts about whether what Google was doing is fair use, and one of the big arguments in support of fair use is that you could do some pretty cool, transformative things with a massive corpus of books. And so in the middle of that lawsuit, there's a whole group of text data mining researchers, digital humanities scholars mostly, who got together and filed a brief

12
00:03:07.641 --> 00:03:13.601
and laid out for the court why these kinds of uses were so important and so persuasive.

13
00:03:13.761 --> 00:03:35.671
Taylor Faires: And that was actually, both in the Google books, lawsuit and a parallel lawsuit filed against, HathiTrust. And so, Author's Alliance formed right after that. And a lot of those digital humanities, text data mining scholars, were some of our founding members who really wanted to support this idea of fair use to cultural materials.

14
00:03:35.771 --> 00:04:02.950
Taylor Faires: So we've been doing it for a while. Over the last year and a half, we've had a grant from the Mellon Foundation to focus a little bit more on supporting text data mining researchers and helping them navigate through the law. And that has been in a few different forms. One is workshops like these. The other is just consulting with individual researchers, having conversations with them.

15
00:04:03.151 --> 00:04:19.301
Taylor Faires: And then the third is a report that we are in the midst of finalizing and publishing that works through where the law is working for text data, mining research, and where it's not working with the idea that this is a good opportunity to kind of

16
00:04:19.721 --> 00:04:34.030
summarize some of the big issues that need some more sustained attention. And we'll go through those issues as we talk. Some of them I think you're probably already familiar with. Some of you may be like banging your heads against the wall right now on some of these

17
00:04:34.371 --> 00:04:35.991
Taylor Faires: So

18
00:04:36.251 --> 00:04:37.731
Taylor Faires: that's where we're going.

19
00:04:38.101 --> 00:04:43.281
Taylor Faires:  did did I change?

20
00:04:43.361 --> 00:05:07.291
Taylor Faires: So really, I'm gonna give a little bit of overview of just text data mining. I am not a text data mining researcher, but I talk to enough people and I'm familiar enough with the research. I can give you some sense of like the cool things that they do. And sometimes it just seems like magic to me where they take this giant corpus. And there's an awful lot of effort that goes into cleaning up the data and that's

21
00:05:07.291 --> 00:05:17.890
a ton of effort that goes into cleaning up the data and actually that leading into some of the analysis, but I just wanted to give you a taste for for those of you who may be less familiar with

22
00:05:18.071 --> 00:05:29.911
Taylor Faires: with the range of applications, because I think that's a pretty interesting piece of this. And I'm not sure what it looks like here at the University of Chicago. but I found that at a lot of places

23
00:05:30.161 --> 00:05:58.941
Taylor Faires: you end up having sort of clusters of folks who are interested in this in certain subfields or disciplines. And one of the things I've appreciated going around to different universities and talking about this is that it really spans almost every discipline. Everybody from people in the health sciences, to the digital humanities, to folks in the social sciences. And so these are things that I think at a university level there should be a lot of attention paid to.

24
00:05:59.181 --> 00:06:16.121
Taylor Faires: So then we'll talk through copyright and licensing. That's like the core stuff that you need to know if you want to get engaged in this work. if you're supporting people who are engaged in this work, either on the technology side or as a librarian.

25
00:06:16.291 --> 00:06:44.250
Taylor Faires: And then we'll also talk about a particular kind of, not a niche use case, but a use case where there are some special rules that apply when you are using materials that are protected by digital locks like DRM, and how to navigate through that process through the DMCA and then some practical takeaways, of which we have many.

26
00:06:44.491 --> 00:06:50.050
Taylor Faires: So text and data mining, there are lots of definitions about this

27
00:06:50.111 --> 00:07:16.310
Taylor Faires:  Text and data mining for our purposes is, we're referring to automated analytical techniques aimed at analyzing digital text and data. So this is really kind of looking for patterns, looking for themes and trends across a large corpus of material, often a large corpus of material. And

28
00:07:16.311 --> 00:07:32.551
Taylor Faires: the reason for this? What we we really focus on is the why like, why are people doing this kind of analysis? And this is from Marti Hurst, a researcher at UC Berkeley's iSchool who's written a lot about this and is one of the kind of

29
00:07:32.851 --> 00:07:48.320
Taylor Faires: original I guess text data mining people from, you know, been doing this for 20 plus years and really focuses in on, we're doing this because we wanna be able to use computational techniques to make discoveries that no individual person could do

30
00:07:48.321 --> 00:08:10.861
Taylor Faires: on their own. And this was actually a point in our brief to the court years and years ago in Google books and HathiTrust. We include lots of examples of you know, here are themes that if you sat down as a human, and you decided you were gonna somehow read these 1,000 books, and just jot down your notes, you would never be able to come up with these.

31
00:08:10.911 --> 00:08:39.920
Taylor Faires: So that's kind of the gist of it. You know, we we see these kind of techniques applied across a range of fields. These are some in the sciences, some interesting work trying to track metabolic pathways. There's some really interesting work looking at medical imaging and trying to identify patterns across medical imaging and patient records, patient health records. There's a lot of work being done there

32
00:08:39.921 --> 00:08:58.841
Taylor Faires: to try to mine patient health records and some some pretty cool stories about people doing that and identifying diagnoses that were missed by doctors because they don't have this bigger picture of like what a thousand other people who had those same symptoms maybe look like

33
00:08:59.041 --> 00:09:02.380
so those are some really interesting ones.

34
00:09:02.631 --> 00:09:25.441
Taylor Faires: There's some other interesting ones in in the social sciences. This is a super interesting study to me done by some folks at Stanford that were looking at analyzing transcripts, basically of body cam footage from the Oakland police department and trying to understand if there were any differences in the ways police officers spoke with individuals that they interacted with

35
00:09:25.441 --> 00:09:33.581
based on race. And they did discover some pretty interesting things there. So so that was one

36
00:09:33.771 --> 00:09:42.711
Taylor Faires: one study in the social sciences. There's, you know, scores of these things. I'm just sort of throwing these up. Here's the Smorgasbord of kind of interesting little data points.

37
00:09:42.961 --> 00:10:13.790
Taylor Faires: This is another from a group at the University of Richmond, called the distant viewing lab, who works on large projects to visualize a large visual corpora. So the top left corner there, they studied television shows and one of the things here is they looked at the representation of female main characters in Bewitched, and I Dream of Jeannie

38
00:10:13.921 --> 00:10:32.811
Taylor Faires: and one of the things that they were trying to investigate was how gender is being performed and represented through measurable formal elements, like presence on screen and the characters, faces on screen. And so this is just a visualization of one of their analyses

39
00:10:33.181 --> 00:10:45.691
Taylor Faires: and it kinda highlights and you can just see it pretty clearly, visually, the pretty dramatic difference between the 2 shows in terms of representation on screen.

40
00:10:46.911 --> 00:11:11.921
Taylor Faires: And then text, so there's a lot of work on TDM research applied to literature. And we're gonna talk about this a bit throughout. So this is a particularly interesting example. It highlights the power of text data mining across a pretty large corpus of literary works. I think in this case it was like 104,000

41
00:11:11.921 --> 00:11:21.111
Taylor Faires: books with comparative analysis across 2 sources, the Chicago Corpus, which y'all may be familiar with and HathiTrust.

42
00:11:21.111 --> 00:11:39.041
Taylor Faires: And it really demonstrates some of the the hidden meaning the patterns and the things that we wouldn't be able to identify just as individuals looking at these works. And in this case kind of showing a counterintuitive result that this shows, despite assumptions about social trends

43
00:11:39.101 --> 00:11:52.830
Taylor Faires: and a more inclusive society sort of developing between the 1800s and mid twentieth century, we actually see depictions of women in fiction falling over this time. So it raised a lot of questions about why is that happening.

44
00:11:52.831 --> 00:12:03.410
Taylor Faires: one little thing. I'll add to this a lot of this. Research also starts to get at interesting questions, I think, for libraries about

45
00:12:03.411 --> 00:12:23.171
the development of collections over time. Because what I end up reading frequently about these sorts of analyses is going back and saying, Yeah, but where did we get this collection of books? And what does this sample - is this really representative? And why did libraries collect these books? And so I think there's a lot of kind of bibliographic

46
00:12:24.341 --> 00:12:29.751
Taylor Faires: introspection that needs to happen there, too. So I think it's fascinating for libraries.

47
00:12:29.771 --> 00:12:35.701
Taylor Faires: some of this is kind of broken out into more popular, the more popular worlds anyone ever heard of Prosecraft?

48
00:12:35.791 --> 00:13:03.420
Prosecraft doesn't exist anymore because they sort of got forced out of existence. It was. It was a kind of interesting site. It was basically like text data mining for people who were authors, are authors, and wanted to kind of benchmark their work against others. And so you could use this and kind of get some statistics and interesting stuff about how your work compares to great works of literature.

49
00:13:03.421 --> 00:13:14.391
Taylor Faires: And you could do that on all sorts of kind of analytical method metrics like, do I use, you know more adverbs than Charles Dickens, or stuff like that.

50
00:13:14.391 --> 00:13:28.731
So this got kinda pushed out of existence. There was a a sort of Twitter mob that developed and part of it was a very, very strong reaction to the idea that individual authors works were being used

51
00:13:28.731 --> 00:13:51.151
Taylor Faires: as data to create this type of tool? I think it also got kind of caught up in the creator of this tool described it as AI powered by AI, and people are like kind of terrified of that, even though it's sort of AI, I guess. Aiish so it's made its way out there into the broader world.

52
00:13:51.421 --> 00:14:14.331
Taylor Faires: So we wanted to talk a little bit about AI AI, too. And that gives a a good opportunity to push into that direction. A little bit. And you know, a lot of the same types of legal issues that show up with text data mining also show up in artificial intelligence applications. And one of the big kind of preliminary reasons why is

53
00:14:14.331 --> 00:14:25.611
Taylor Faires: to train large language models, and I won't go into like super detail on this, but to train large language models, you need large amounts of language. You need large, large corpus of text.

54
00:14:25.611 --> 00:14:42.291
Taylor Faires: And so that has proven to be quite controversial because of the sources from which these LLMs have been trained. So this is a paper written by some folks at a Luther AI. Has anyone heard of Luther before?

55
00:14:42.311 --> 00:14:54.801
Taylor Faires: They're a nonprofit they've done a lot of work on artificial intelligence training. And this was a paper about a kind of training data set that they created called the pile.

56
00:14:54.931 --> 00:15:16.981
And the pile actually incorporates a number of other interesting and and diverse, from a subject matter basis, datasets that are useful for pretty robust LLM training. And if you look in the bottom right hand corner of that paper. It lists some of the different sources that they've used.

57
00:15:17.641 --> 00:15:19.111
Some of them

58
00:15:19.121 --> 00:15:42.460
Taylor Faires: no legal problems at all like they pulled from the US Patent and trademark database like there's no no copyright issues with that and, like many others, they pulled from many kind of open source, openly licensed copyright free places and that's been a kind of theme with these is trying to get large core material that are publicly available and sort of legally clean.

59
00:15:42.511 --> 00:15:53.840
And you want a pretty good diversity. That's why you'll read like one of the kind of funny things in here, they're trying to train on emails. Apparently one of the best sources of emails as a large data set

60
00:15:53.841 --> 00:16:13.670
Taylor Faires: is the publicly available copies of the Enron case emails, and so they they relied on those but where this gets more controversial is, they use this data set called Books 3, which is made up of 183,000 or so books,

61
00:16:13.671 --> 00:16:19.400
Taylor Faires: which mostly are coming from places like like

62
00:16:19.761 --> 00:16:29.281
Taylor Faires: libgen, and z library and other pirate websites. And so that's caused a lot of controversy.

63
00:16:29.381 --> 00:16:41.220
Taylor Faires: So that's sort of text data mining and a little bit of AI. Now we get into the main and really important question that I'm handing it off to Rachel to answer totally for you.

64
00:16:41.431 --> 00:16:52.980
Taylor Faires: Is it legal? Hi! Everyone! So in the next couple slides, large handful of slides. I'm gonna lay some groundwork for how copyright law and licenses regulate text and data mining.

65
00:16:53.211 --> 00:17:12.521
Taylor Faires: We're gonna try to focus the bulk of this presentation on practical information and key takeaways. Not getting you to mired in the sort of copyright details. But it's good to have a general sense of the copyright law operating in the background, since it regulates much of what researchers are and aren't able to do in terms of text and data mining

66
00:17:12.681 --> 00:17:26.711
Taylor Faires: and it creates significant obstacles which across exist across areas of study. But there are also copyright principles that provide a legal basis for researchers ability to conduct text and data mining. License agreements which I'll touch on on the end

67
00:17:26.741 --> 00:17:36.051
Taylor Faires: also can also restrict text and data mining research, which is a really tough problem. And we'll spend some time talking through a couple of examples at the very end of this section.

68
00:17:37.341 --> 00:17:43.420
Taylor Faires: So at the most basic level copyright protects any and all original creative expression.

69
00:17:43.491 --> 00:17:52.830
Taylor Faires: This covers obvious things like novels, songs, and paintings, but also less obvious modes of creative expression like computer code and personal emails.

70
00:17:53.491 --> 00:18:09.430
Taylor Faires: And it attaches immediately and automatically. What this means is that you sketch a quick doodle or write some notes during this presentation you will have created a copyrighted work, and you hold a copyright in that work automatically, without any extra action on your part.

71
00:18:10.171 --> 00:18:25.510
Taylor Faires: The very first copyright law dates back to the eighteenth century, but modern copyright law has its roots in the US Constitution's intellectual property clause, as the words on the screen that we've highlighted to promote the progress of science and the useful arts suggests.

72
00:18:25.881 --> 00:18:36.230
Taylor Faires: The goal of copyright is to incentivize new creation by allowing creators to do what they wish with their works and rights for the duration of copyright. We sometimes call it a temporary monopoly.

73
00:18:36.981 --> 00:18:50.220
Taylor Faires: And I'll note, too, that intellectual property clause mentions limited times. Copyright doesn't last forever. The current term is the life of the author, plus an additional 70 years, but it's lengthened a lot over time. It used to be much shorter.

74
00:18:50.641 --> 00:18:58.581
Taylor Faires: Once a copyright expires, a work enters the public domain, and becomes free for all of us to use without the ordinary copyright restrictions.

75
00:18:59.181 --> 00:19:10.430
Taylor Faires: For this reason, a lot of text and data mining research on literary works has focused on literary works that are in the public domain. So those published in the 1920s, and earlier.

76
00:19:12.241 --> 00:19:26.220
Taylor Faires: And when we talk about a copyright, what we really mean is a bundle of different rights. When you own a copyright in a work, you're the only one who can reproduce it, who can create a derivative work like a movie adaptation or a foreign language translation,

77
00:19:26.291 --> 00:19:41.021
Taylor Faires: distribute copies of your work, perform or display it publicly, and if others want to exercise any of those rights like, perform your play or publish your book. They need to get permission from you, the copyright holder, and that is typically handled in a license agreement

78
00:19:42.061 --> 00:19:52.880
Taylor Faires: and copyright holders can sue to enforce their rights when others don't get permission, unless an exception or limitation to copyright protection like fair use applies, which I'll get to in a minute.

79
00:19:54.541 --> 00:20:06.910
Taylor Faires: So because copyright is intended to incentivize new creation and protects original expression, it doesn't protect ideas, facts, or concepts. This is because facts aren't invented

80
00:20:06.941 --> 00:20:19.180
Taylor Faires: or created, but discovered. And as a matter of public policy, keeping others from using an idea or a concept isn't good for the progress of knowledge, and so works against the very purpose of copyright law.

81
00:20:19.651 --> 00:20:35.900
Taylor Faires: Facts and ideas are the building blocks upon which creations and knowledge are built, so it doesn't make sense to keep others from using this, these types of information and it follows that some types of copyrighted works can contain within them unprotected information, like facts and ideas.

82
00:20:36.171 --> 00:20:46.000
Taylor Faires: So, for example, a biography of a famous figure might include a lot of creative expression, but it will also include uncopyrightable facts, like when that person was born.

83
00:20:46.481 --> 00:20:55.111
Taylor Faires: This can free up a lot of materials for onward creation. A copyright holder can't sue someone for using a fact from their work in a second work.

84
00:20:55.531 --> 00:21:17.530
Taylor Faires: And a related concept known as the Merger Doctrine provides even more protection. If there are just a few ways of saying something like the University of Chicago, was founded in 1890, the law will not provide protection for that sentence. There's simply not enough new creative material there. And you don't have to strain language to avoid using someone else's phrasing when it comes to something so simple.

85
00:21:18.891 --> 00:21:31.031
Taylor Faires: And importantly for our purposes today, copyright law can, and in many cases does, limit text and data mining research. First, because only the copyright holder can exercise copyright's exclusive rights

86
00:21:31.111 --> 00:21:33.670
Taylor Faires: and because others need permission to do so.

87
00:21:33.771 --> 00:21:39.320
Taylor Faires: When text and data mining research involves an exclusive right, like making a copy of a work,

88
00:21:39.421 --> 00:21:50.781
Taylor Faires: that researcher might be running a file of copyright law and second, under a 20 year old amendment to copyright law that Dave mentioned a bit earlier,  the DMCA or Digital Millennium Copyright Act.

89
00:21:50.881 --> 00:22:00.661
Taylor Faires: It's a violation of that part of copyright law to circumvent technical protection measures on copyrighted works which can stand in the way of text and data mining research.

90
00:22:01.011 --> 00:22:14.581
Taylor Faires: We're gonna talk a lot more about that in the next section. But despite these limitations, there are still a lot of different types of works available for text and data mining research. So those include older works, the public domain works I mentioned earlier

91
00:22:14.871 --> 00:22:42.590
Taylor Faires: as well as works published or created by the Federal Government, which are also automatically a part of the public domain. So if you recall the earlier example, Dave mentioned from the distant viewing lab at the University of Richmond. The photographs that those re those researchers study were taken by the Federal Government, or rather photographers hired by the Federal government, so they were automatically a part of the public domain. And the copyright issues weren't weren't really present there.

92
00:22:43.381 --> 00:22:56.111
Taylor Faires: And of course, researchers can conduct text and data mining on licensed collections for text and data mining, because the rights holders have affirmatively given permission for the researchers to do so via that licensed collection.

93
00:22:56.511 --> 00:23:14.331
Taylor Faires: Finally and importantly, for our purposes today, as of 2021, academic researchers can conduct text and data mining research on copyrighted works outside of licensed collections under a brand new regulation from the copyright office, which, again, we'll speak of in the next section.

94
00:23:15.471 --> 00:23:23.790
Taylor Faires: So fair use, I mentioned earlier, is a super important concept in copyright law, one of the things we talk about all the time

95
00:23:24.151 --> 00:23:39.891
Taylor Faires: which allows other people to use others' copyrighted works without permission in certain circumstances and for certain purposes. It's an exception to the normal rule that only the copyright holder can exercise any of those bundle of rights that copyright confers.

96
00:23:40.541 --> 00:23:45.520
Taylor Faires: We put an excerpt from the statute that enshrines fair use up on the slide.

97
00:23:46.071 --> 00:24:02.851
Taylor Faires: Text and data mining is widely considered to be a fair use. And the Fair Use Doctrine is a large part of what permits text and data mining research. Before I get into specifics, I want to quickly draw your attention to some of the enumerated purposes that fair use envisions and specifically favors

98
00:24:02.981 --> 00:24:15.901
Taylor Faires: Teaching, research and scholarship are explicitly listed as fair use. This is hugely important for establishing text and data mining research as a fair use because it fits so comfortably within research and scholarship

99
00:24:17.821 --> 00:24:22.890
Taylor Faires: Fair Use is a 4 factor inquiry that asks a series of questions about a first work,

100
00:24:23.011 --> 00:24:30.881
Taylor Faires: a second work that is using that first work, and the particular characteristics of that use in order to determine whether the use is fair.

101
00:24:31.121 --> 00:24:40.281
Taylor Faires: It considers the purpose and character of the use, so whether it's something new and different from the original, whether it transforms the original into something new

102
00:24:40.291 --> 00:24:53.641
Taylor Faires: or merely used it for the same purpose as for which it was created. Second, the nature of the first work, the closer it is to the core of copyright, so the more creative it is, the less this factor will favor fair use.

103
00:24:54.111 --> 00:25:07.650
Taylor Faires: Third, the amount and substantiality of the portion of the first work that was used. This asks how much of the first work was used? A short excerpt, or the full work something more peripheral? Or is it really the heart of that work?

104
00:25:07.981 --> 00:25:23.670
Taylor Faires: Courts will also often ask whether the portion of the work that was used is reasonable in light of the purpose of the use. This is an important principle within the context of text and data mining research, because for some types of research, the project only works if you can use the full copyrighted work.

105
00:25:24.641 --> 00:25:30.351
Taylor Faires: Lastly, does the use hurt the work? Does the use hurt the market for the original work?

106
00:25:30.781 --> 00:25:41.830
Taylor Faires: This factor asks whether the second work is a competing substitute for the original work. While all of these factors are important, the first and fourth factors are generally held up as the most important ones.

107
00:25:42.421 --> 00:25:47.360
Taylor Faires: So to create a kind of shorthand test you might think about when considering fair use.

108
00:25:47.591 --> 00:25:57.670
Taylor Faires: When a second work transforms a first work into something new and different, and in so doing doesn't harm the market for that first work, a use is generally likely to be fair

109
00:25:58.081 --> 00:26:07.331
Taylor Faires: In the context of text and data mining, text and data mining research uses copyrighted works for a very different purpose from the purpose for which they were created.

110
00:26:07.791 --> 00:26:25.700
Taylor Faires: It transforms the original works by extracting information from the expressive text in order to find patterns in that data. And while text and data mining does often use the entirety of copyrighted works, this is in many, if not most cases, reasonable in light of the purpose of the research

111
00:26:26.461 --> 00:26:32.850
Taylor Faires: And text and data mining research is highly unlikely to compete in the market for the with the original works themselves.

112
00:26:33.081 --> 00:26:41.100
Taylor Faires: It's patterns and information about a copyrighted work can't really substitute for serve as a substitute for the work itself.

113
00:26:42.651 --> 00:26:53.151
Taylor Faires: There are 2 fair use cases over the past decade that went a long way towards laying the groundwork for text and data mining as a fair use Dave mentioned these at the outset, so I'm gonna try not to spend too much time on them.

114
00:26:53.211 --> 00:27:06.290
Taylor Faires: So they rose out of the same basic facts. Google had partnered with university libraries to digitize books in their collection and create a massive multi-million volume purpose and full text searchable database for the Google Books project.

115
00:27:07.351 --> 00:27:17.390
Taylor Faires: HathiTrust did more or less the same thing, but it also allowed full text copies for blind and print disabled users, and provided more sophisticated systems for computational analysis.

116
00:27:17.641 --> 00:27:33.931
Taylor Faires: Then the Authors Guild again, not to be confused with Author's Alliance, as we are, in fact, often at odds with them in policy debates, sued Google and HahtiTrust for copyright infringement, and the organizations defended their projects on the grounds that they were fair use.

117
00:27:35.171 --> 00:27:47.301
Taylor Faires: Dave mentioned, too this is sort of part of the founding origin story of Author's Alliance. Both cases had amicus briefs submitted by a group of digital humanities scholars arguing that the services were fair use

118
00:27:47.521 --> 00:28:05.580
Taylor Faires: and making really compelling arguments in the sort of 4 factors world that Google's new and different purpose was a large, large part of what made the use fair, and also explained a lot about the value of digital humanities research more generally. And this was, you know, back in 2015. So it was a

119
00:28:06.201 --> 00:28:11.981
Taylor Faires: this brief, too, I think, was really important for making the arguments that text and data mining research was a fair use.

120
00:28:13.721 --> 00:28:21.500
Taylor Faires: So regarding the purpose of the use, the second circuit said that Google and HathiTrust made highly transformative uses of those original works.

121
00:28:21.991 --> 00:28:28.730
Taylor Faires: Creating a full-text, searchable database was undoubtedly a totally different purpose from the books themselves.

122
00:28:29.041 --> 00:28:43.790
Taylor Faires: On the nature of the works, the second circuit said that some works were creative, some less so. So on balance, it wasn't particularly important in the analysis and on the question of the amount and substantiality of the portions of the work that was used,

123
00:28:44.201 --> 00:28:53.541
Taylor Faires: the second circuit said that while the organizations digitize the works in full, this was appropriate given the purpose of the digitization project.

124
00:28:53.721 --> 00:29:04.610
Taylor Faires: This was a really important holding because new innovative uses of copyrighted works like text and data mining and more recently, like generative AI often do involve using entire works.

125
00:29:05.111 --> 00:29:15.810
Taylor Faires: And on the question of the effect of the use on the market for the original works, the second circuit said that the databases did not provide competing substitutes for the original works that it contained.

126
00:29:16.211 --> 00:29:26.470
Taylor Faires: Google paid close attention to security and made sure that the works in its collection couldn't be used for consumptive purposes. So Internet users couldn't read the full book on the Google Books Project.

127
00:29:27.281 --> 00:29:39.411
Taylor Faires: This is a limitation that was imported into our new text and data mining exemption that Dave will talk about in the next section, and it's a really important one for ensuring that projects like these stay within the bounds of what the law allows.

128
00:29:41.601 --> 00:29:57.521
Taylor Faires: Another relevant case for our purposes is I paradigm AV Xrell vanderhive versus I Paradigm, a case about the legality of turnitin.com, a service that checks student papers for possible plagiarism by comparing these to a database of other student papers.

129
00:29:57.921 --> 00:30:09.260
Taylor Faires: Turnitin argued that this was a fair use, as it used the papers for a very different purpose than the purpose for which they were created, preventing plagiarism as opposed to education and scholarship.

130
00:30:10.451 --> 00:30:17.561
Taylor Faires: Turnitin, had to use the entirety of the papers for its database in order for the tool to be effective like Google books.

131
00:30:17.851 --> 00:30:24.161
Taylor Faires: So this case further entrenched the notion that using the full work as a fair use can be reasonable. In some cases

132
00:30:25.691 --> 00:30:36.321
Taylor Faires: similar cases have arisen around search engines and the use of thumbnails of images. Thumbnails, and image search results are used for really different purposes than the images themselves

133
00:30:36.371 --> 00:30:50.970
Taylor Faires: and it's necessary for a search engine to display the entire image in order for that thumbnail to be useful to an image searcher. Several cases and cases in the ninth circuit have held that Google images and search engines like it are fair use.

134
00:30:52.961 --> 00:31:01.481
Taylor Faires: So while the topic of today's conversation is text and data mining, this kind of computational research is really closely tied into artificial intelligence.

135
00:31:01.711 --> 00:31:14.480
Taylor Faires: So, as such, we thought it was worth sharing a bit about the parallel conversations around copyright and generative AI. So as I've said, Google books lay at the groundwork for computational research as a fair use

136
00:31:15.111 --> 00:31:21.461
Taylor Faires: like text and data mining research that this decision provides strong legal support for generative AI research.

137
00:31:22.340 --> 00:31:37.061
Taylor Faires: And copyright issues and machine learning models have been percolating for a long time. The copyright Office actually held an event to discuss it way back in 2020. But things really kicked off when, in late 2022, when Openai introduced ChatGPT.

138
00:31:37.351 --> 00:31:49.831
Taylor Faires: So shortly thereafter, the first class action was filed against Openai, Github, and Microsoft. The first of many for allegedly violating open source licenses in its use of software to train AI models.

139
00:31:50.351 --> 00:31:55.831
Taylor Faires: Then, about a year ago, the copyright office waded into the waters of copyright and generative AI,

140
00:31:56.011 --> 00:32:10.251
Taylor Faires: issuing an opinion letter stating that AI generated works were not protected by copyright, and creators couldn't register register copyrights in AI generated works that they had created with AI programs like Mid Journey or ChatGPT.

141
00:32:11.011 --> 00:32:35.380
Taylor Faires: Next, the copyright office launched an AI initiative which is still ongoing, responding to the increased public attention on the issue, and the lawsuits that had begun to crop up that initiative began with a series of listening sessions where stakeholders from various industries weighed in on copyright and AI generated works in the realms of textworks, text-based works, images, music, and audio visual works.

142
00:32:35.761 --> 00:32:59.010
Taylor Faires: It also published what's known as a notice of inquiry on copyright and generative AI, posing around 20 questions that various stakeholders, members of the public could respond to. The Copyright Office received over 10,000 comments in response, which shows the degree of attention this is getting from those legal community. As a side note, they are all public and online. If anyone wants to really fall down the rabbit hole.

143
00:32:59.561 --> 00:33:06.510
Taylor Faires: As of today, we have more than a dozen copyright lawsuits against companies that develop and deploy generative AI models.

144
00:33:07.771 --> 00:33:13.461
Taylor Faires: All this leaves us with 3 big questions about the intersection between generative AI and copyright.

145
00:33:13.821 --> 00:33:30.800
Taylor Faires: First, our generative AI outputs protected by copyright under today's copyright laws? The answer is no. In order for our work to be copyrightable, and must have been created by a human. This is known as copyright's Human authorship requirement and it's a principle that dates back to the late 1800s.

146
00:33:31.411 --> 00:33:39.240
Taylor Faires: Second, is it permissible to use copyrighted works as training data under the doctrine of fair use? We think so. This is

147
00:33:39.371 --> 00:33:53.091
Taylor Faires: probably the most controversial of the copyright questions around generative AI and the basis of many of the ongoing lawsuits. Generative AI models use copyrighted works for a very different purpose from that for which they were created

148
00:33:53.331 --> 00:33:59.050
Taylor Faires: and neither the training use, nor the underlying model compete with the copyrighted works themselves.

149
00:33:59.341 --> 00:34:19.540
Taylor Faires: But the controversy lies in the fact that generative AI outputs might theoretically compete with the works contained in the training data. We've argued., and I think others have argued that this point conflates questions about copyright and inputs with questions about copyright and outputs and it's something that we will certainly see play out in court.

150
00:34:20.101 --> 00:34:38.901
Taylor Faires: The last question is, can AI outputs themselves infringe the copyrights on the works in the works they're trained on. We say, yes. Outputs can infringe existing copyrights, and we believe this is something that can be handled with existing copyright doctrine, and ordinate existing legal tests within the copyright doctrine.

151
00:34:40.791 --> 00:34:53.211
Taylor Faires: Okay, so I'm going to spend just a minute talking about the copyright implications of publishing text and data mining research as this is a question we've gotten a lot during previous workshops. As with conducting the research,

152
00:34:53.311 --> 00:34:57.020
publishing text and data, mining research is generally a fair use.

153
00:34:57.331 --> 00:35:04.510
Taylor Faires: Recall that fair use favors, research and scholarship and publishing. This kind of research implicates both of these purposes.

154
00:35:05.141 --> 00:35:08.420
Taylor Faires: But it's also important to consider what outputs look like.

155
00:35:08.511 --> 00:35:17.620
Taylor Faires: If a text and data mining publication involves reproducing lengthy excerpt from the book or works that research is being conducted on,

156
00:35:18.101 --> 00:35:22.140
Taylor Faires: it's a good idea to think about how much this is serving a scholarly purpose.

157
00:35:22.191 --> 00:35:28.861
Taylor Faires: So short snippets being reproduced, like, what Google did in the Google books case are generally acceptable

158
00:35:30.141 --> 00:35:44.231
Taylor Faires: when published. Text and data mining research doesn't comply with copyright laws. There can be really serious consequences. Lawsuits are possible, but not super likely, mostly for optics reasons, and because scholarly research has such a strong, fair use case.

159
00:35:44.461 --> 00:35:54.161
Taylor Faires: But there can be professional and ethical consequences to violating copyright law or licenses. So this is an example we've we've spoken about a couple of times, and I think it's a pretty compelling one.

160
00:35:54.201 --> 00:36:08.170
Taylor Faires: In 2021 there is a group of Indian and Canadian researchers who ended up retracting a paper they'd published which sought to systematically track vaccine hesitancy and logistical challenges associated with Covid vaccine covid vaccines in the US.

161
00:36:08.251 --> 00:36:16.030
Taylor Faires: The authors had conducted data mining on a database of news articles on factiva and after publication it came to light that

162
00:36:16.331 --> 00:36:42.191
Taylor Faires: the data mining didn't comply with the University of Ottawa's license with factiva and Dow Jones, which on the database, asked the author to retract the article. So this shows how it's prudent to not assume that a license, either with an institution or with you as user of a service, permits data mining. But to try to verify this information before moving forward. I'll acknowledge that this is obviously not always a clear-cut task.

163
00:36:42.301 --> 00:36:47.310
Taylor Faires: and I'm gonna say a little bit more about license agreements before wrapping up this section.

164
00:36:48.421 --> 00:36:54.081
Taylor Faires: So like copyright law, license agreements can limit a researcher's ability to conduct text and data mining.

165
00:36:54.181 --> 00:37:02.210
Taylor Faires: Sometimes license agreements between ebook vendors, streaming services, etc., and the public or an academic institution

166
00:37:02.301 --> 00:37:16.880
Taylor Faires: will explicitly forbid text and data mining, but more often licenses will forbid doing something that's necessary for this kind of research, like breaking digital locks. We see this in a vast majority of consumer facing ebook licenses.

167
00:37:17.551 --> 00:37:33.491
Taylor Faires: There are also a lot of limitations to using license databases for text and data mining. We've heard from researchers that having to use these databases can limit their area of study and even lead lead them to abandon research questions that aren't suited to whatever the relevant database they have access to is.

168
00:37:35.581 --> 00:37:46.661
Taylor Faires: So, for example, here's a quick excerpt from Amazon's Kindle store terms of use, one of the many license agreements users agree to when we download content from the Amazon Kindle store.

169
00:37:47.561 --> 00:37:53.461
Taylor Faires: The highlighted text forbids a user from circumventing any technical protection, measures like DRM.

170
00:37:53.771 --> 00:38:09.730
Taylor Faires: Even if circumventing these technical protection measures is necessary to make a fair use of a copyrighted work like conducting text and data mining, the terms of the license forbid it. This means that text and data mining research on Amazon ebooks can create liability under a contract

171
00:38:09.771 --> 00:38:25.800
Taylor Faires: even if it doesn't create liability under copyright laws. This is what's known as contractual override. and it's a huge problem for a variety of research communities. The conflict arises when copyright laws, with their various intricacies and exceptions, permit some sort of activity,

172
00:38:26.121 --> 00:38:40.750
Taylor Faires: but a digital object like an ebook or software which is not owned but licensed is licensed under an agreement that forbids that activity. The problem here is that contracts can and do limit fair use in a lot of different ways.

173
00:38:41.011 --> 00:38:57.211
Taylor Faires: And unfortunately, it's not a problem that we, as copyright lawyers, or even the copyright office, can solve. In this country we have freedom of contract. Companies are generally free to draft license agreements any way they wish, even when they severely limit our rights under copyright.

174
00:38:59.251 --> 00:39:16.861
Taylor Faires: Lastly, I wanted to share with you all some sample language for sort of positive example. So while licensed licenses can inhibit text and data mining research. They're really flexible. And licenses can also be used to affirmatively authorize text and data mining research.

175
00:39:17.221 --> 00:39:24.520
Taylor Faires: This language is an excerpt from a sample license agreement that affirmatively permits text and data mining research on the license materials.

176
00:39:24.571 --> 00:39:46.060
Taylor Faires: It's sample language that was developed by team at Uc Berkeley to help academic librarians negotiate for license terms that permit rather than forbid this kind of research. So we wanted to include this slide to show the flexibility of license agreements while we've seen them be used to inhibit text and data mining and practice. They could also be used to underscore its legality.

177
00:39:46.631 --> 00:39:49.010
Taylor Faires: And I'm going to turn it back over to Dave.

178
00:39:50.261 --> 00:39:52.530
Taylor Faires: Let me check on time.

179
00:39:52.821 --> 00:40:02.210
Taylor Faires: We're good. Okay, so digital locks. This is sort of a specialty area. Has anyone here ever tried to rip a DVD.

180
00:40:02.601 --> 00:40:03.921
Taylor Faires: You can admit to it, I won't turn you in.

181
00:40:04.221 --> 00:40:27.631
Taylor Faires: It's hard to do, or it used to be hard to do. Now, it's actually pretty easy. You can look online, and there's all sorts of things that you can download to help you do this. But all those things are technically illegal, so in a weird twist that, well, to go back a little bit, Congress

182
00:40:27.841 --> 00:40:41.110
Taylor Faires: in the late 1990s decided that copyright wasn't enough to protect rights holders interests. And so they said, we're gonna create a new law called the Digital Millennium Copyright Act.

183
00:40:41.111 --> 00:41:08.770
Taylor Faires: And what we're gonna do with this is, we're gonna add an additional layer of liability. If you take a digital work that is protected by a digital lock and you circumvent that and they created 2 sets of liabilities. Actually, one was for people who are engaged in that activity, like breaking the digital locks and the other set of liability they provided for was for people who create tools that would enable that kind of activity.

184
00:41:09.061 --> 00:41:30.960
Taylor Faires: So the reason why I said all of that those things like jailbreak and other stuff like that to to circumvent DRM and other technological protection measures is illegal is congress did provide a pathway through to to break for users who want to break DRM, and have a legitimate fair use

185
00:41:30.961 --> 00:41:53.931
Taylor Faires: they never bothered to create an exception for people to actually develop the tools to do that. And so those people are in kind of legal limbo land and no one's ever gone after them, for, like legitimate fair uses, but if you're a security researcher, just, you know, keep that in mind. But for technological protection measures, these are the kinds of things that are protected.

186
00:41:53.931 --> 00:42:21.060
Taylor Faires: And as I said, the DMCA provides an extra layer of liability around this and the law is like pretty straightforward. It says, no person shall circumvent a technical measure. That's like step one. But then they kinda walk it back and they say, "Well, but that restriction doesn't apply if you're a user who's engaged in

187
00:42:21.231 --> 00:42:48.941
Taylor Faires: a use that is non infringing." So if if this thing is substantially inhibiting your ability to engage in a non infringing use, ie.  you have a fair use. That's the easy way to read it, but it's not automatic. What you have to do is you have to go kind of beg the copyright office to grant you an exemption that says that you can do this kind of thing. And so they do this every 3 years the Copyright office

188
00:42:49.251 --> 00:42:54.901
Taylor Faires: under the auspices of the Library of Congress, because that's where the copyright office is housed in a kind of weird

189
00:42:55.161 --> 00:42:59.000
Taylor Faires: twist of regulatory structuring.

190
00:42:59.601 --> 00:43:05.871
Taylor Faires: You go to the copyright office and you say, please, can I do this thing? Grant me an exemption?

191
00:43:06.031 --> 00:43:23.990
Taylor Faires: So Authors Alliance, along with a Library Copyright Alliance which is made up of ARL and ALA and the American Association of University Professors petitioned the copyright office for an exemption to allow this to happen. And that was successful

192
00:43:23.991 --> 00:43:49.431
Taylor Faires: in 2020 that petition went through, and the copyright office granted an exemption, and said, yes, we agree that these uses are for the most part fair use and you should be able to do this so they issued this easy to read regulation. So it I throw it all up here because it it's a little lengthy, but it's actually pretty straightforward if you pick it apart.

193
00:43:49.711 --> 00:44:05.011
Taylor Faires: So I wanna walk you through the basics. I think this is helpful to know about, because it gives you, even if you're not doing this kind of work, it's helpful to know that actually, there there are there is a clear path through for lots of these kinds of materials.

194
00:44:05.061 --> 00:44:10.971
Taylor Faires: So the basics are. You have to be a research affiliated with a nonprofit institution of higher education.

195
00:44:11.101 --> 00:44:21.900
Taylor Faires: Everybody at the University of Chicago qualifies for that. They they have a lot of if you read the Reg, a lot of text is devoted to defining what kind of institutions qualify.

196
00:44:22.091 --> 00:44:35.681
Taylor Faires: And I think they were basically trying to weed out you know, the guy in his garage who's running a tech startup and says, "yeah, but I'm a university, too." And so they had to spend some time dealing with that

197
00:44:36.181 --> 00:44:42.801
Taylor Faires: And so the the exemption only applies to a narrow subset of materials. This is important

198
00:44:42.881 --> 00:44:51.501
Taylor Faires: it doesn't apply to, for instance, computer programs. The copyright office specifically exempted those out.

199
00:44:51.571 --> 00:45:16.541
Taylor Faires: It only it doesn't apply to musical works, which is a challenge for some folks. I think that may wanna expand out into other areas, but it applies to literary works, by and large and motion pictures in those formats. For some reason the office included digital downloads in there,

200
00:45:16.681 --> 00:45:32.571
Taylor Faires: but the digital download has to be under a perpetual license. And II mean, I've done these license deals, and I don't think I've ever gotten a perpetual license at scale for motion pictures. So

201
00:45:33.221 --> 00:46:00.611
Taylor Faires: but that's what it applies to and it does come with some strings attached. And one of the biggest strings is that you have to when you create a corpus, you have to implement effective security measures kinda harkening back to the Google Books case and the HathiTrust case, where the court made a big deal about them, protecting the Corpus and not letting it leak out onto the Internet. So universities have to do that, too. And there are kinda 2 ways of going about this. One is you can come

202
00:46:00.611 --> 00:46:09.051
Taylor Faires: up with a security standard with rights holders, or you can use your own standard for highly confidential information. You

203
00:46:09.481 --> 00:46:25.180
Taylor Faires: that isn't always well defined within institutions, as we found, we found a lot of variation from people saying, Oh, we're gonna treat this like like patient health data, which is super extreme, in my opinion to other folks saying

204
00:46:25.301 --> 00:46:39.790
Taylor Faires: yeah, it's you know, we're we're gonna put a password on it, I guess. And maybe not that lax. II have actually found nobody doing an irresponsible kind of version of this, but there's a lot of variation there.

205
00:46:40.051 --> 00:46:53.400
Taylor Faires: And then one of the hitches with this is, if you use your own internal standard, there is a right under the Stat or under the regulation for rights holders to come in and say "we wanna learn more about what this security standard is that you're using."

206
00:46:55.961 --> 00:47:23.620
Taylor Faires: So right now, we're petitioning to expand this. We have a few letters of support from TDM Researchers in support of that expansion. Thank you, Hoyt, in particular. And what we're asking for is an expansion to allow researchers who are creating corporaa under this exemption to also share them with others, because we've encountered this kind of crazy situation where you know people are circumventing DRM, and then there's people at another institution that wanna do

207
00:47:23.741 --> 00:47:31.221
Taylor Faires: work on that same corpus, but they have to jump through the same hoops and do that same sort of circumvention themselves. So we're currently petitioning.

208
00:47:31.241 --> 00:47:36.071
Taylor Faires: Sign up for our emails, check our website if you're interested in learning about that.

209
00:47:36.711 --> 00:47:40.880
Taylor Faires: So that's it on digital locks.

210
00:47:41.421 --> 00:48:01.710
Taylor Faires: I'm so nervous about time, we're still good. We have, I think. Well, it was, we're good for like 1 30. Okay, good. So now we're gonna do practical takeaways, open questions. Rachel, do you wanna come up and go through some of these? And I might kinda chime in as these things go along. Thank you.

211
00:48:03.391 --> 00:48:08.081
Taylor Faires: Okay, here's some practical takeaways and open questions. I promise this section is very quick.

212
00:48:09.631 --> 00:48:14.991
Taylor Faires: So on copyright and licenses, it's really important to remember that not everything is protected by copyright.

213
00:48:15.301 --> 00:48:26.710
Taylor Faires: Older creative works that have entered the public domain, works authored by the Federal government, and facts and ideas are not protected by copyright and not subject to all of its shackles.

214
00:48:27.911 --> 00:48:55.051
Taylor Faires: All this being said, electronic editions of public domain works like those available on consumer ebook platforms could still be protected by digital locks and subject to the DMCA ban on breaking them. So I'm thinking here of like Amazon's Kindle store, Alice's Adventures in Wonderland. It will still have DRM on it, and the DMCA liability will still attach or contractual liability. If one removes that DRM.

215
00:48:55.461 --> 00:49:08.040
Taylor Faires: Next, in a lot of cases, you can get permission from rights holders to make various uses of their works. Getting permission means that a copyright holder is agreeing to allow you to exercise one or more of their exclusive rights

216
00:49:08.511 --> 00:49:21.931
Taylor Faires: and using a licensed database for text and data mining is doing just this. The license agreement between your institution and the service specifically permits text and data mining. Your use comes into play when you don't have permission

217
00:49:22.141 --> 00:49:35.350
Taylor Faires: allowing researchers to use copyrighted works without permission when their desired use falls within the bounds of what fair use allows text and data mining status is a fair use makes this concept particularly important for today's conversation.

218
00:49:36.031 --> 00:49:46.521
Taylor Faires: And there's a particularly strong case for text and data mining as a fair use when it comes to academic research, because teaching research and scholarship, are purposes fair use specifically favors.

219
00:49:47.891 --> 00:49:53.310
Taylor Faires: When it comes to copyrighted licenses, it's important to consider what the outputs of your research look like.

220
00:49:54.261 --> 00:50:07.470
Taylor Faires: So to use the Google Books case as an example in Google books, even though the inputs of the Google Books project were copyrighted, but it actually produced snippets and computational analysis did not reproduce the expressive works.

221
00:50:07.931 --> 00:50:18.151
Taylor Faires: But, on the other hand, if the Google Books project had involved involved allowing Internet users to view the books in their entirety and read books through the through Google books.

222
00:50:18.171 --> 00:50:34.861
Taylor Faires: This would have weighed really heavily against the use being fair. since it, since it would align the purposes of the 2 uses books that the public could read and also create a competing substitute for the original works by giving folks a different way to read them, rather than purchasing or checking out a copy.

223
00:50:35.971 --> 00:50:51.151
Taylor Faires: Providing information about a copyrighted work which in many cases is what text and data mining research does is a lot safer from a copyright perspective recall that copyright protects original creative expression and not information about original creative expression

224
00:50:52.461 --> 00:50:59.760
Taylor Faires: On the DMCA and digital locks, one important takeaway is that most researchers doing text and data mining within higher ed

225
00:51:00.031 --> 00:51:14.940
Taylor Faires: doing text and data mining work within institutions of higher learning, like all of you, are covered by the exemption, it applies broadly to academic researchers affiliated with universities, though, doesn't protect independent researchers that aren't affiliated with academic institutions.

226
00:51:15.761 --> 00:51:24.960
Taylor Faires: and the exemption only applies to motion pictures in DVD or blu-ray form, or digital downloads under perpetual perpetual access models

227
00:51:25.091 --> 00:51:28.050
Taylor Faires: and literary works that are distributed electronically.

228
00:51:28.991 --> 00:51:36.491
Taylor Faires: Finally, the exemption security requirements are potentially really complicated, particularly where it comes to research data sharing.

229
00:51:37.031 --> 00:51:51.640
Taylor Faires: The exemption leaves vague ex what the exact details of the security measures have to look like, which in part provides institutions with flexibility, and also makes them adaptable to changing technological abilities and circumstances.

230
00:51:51.821 --> 00:51:57.751
Taylor Faires: But, on the other hand, it creates open questions about what exactly institutions need to secure their corpora.

231
00:51:58.191 --> 00:52:10.570
Taylor Faires: This is particularly salient when it comes to sharing data with collaborators at other institutions. This is permitted under the exemption, but it doesn't give us much clarity on what the security measures on that sharing have to look like.

232
00:52:12.511 --> 00:52:31.181
Taylor Faires: Some open questions about the exemption include the status of other types of works that aren't included in the classes of works the exemption permits text and data mining research on, but also aren't explicitly disallowed. So music, visual art video games, films available via streaming services

233
00:52:31.241 --> 00:52:47.481
Taylor Faires: are some examples of types of works that some researchers do want to conduct text and data mining research on. And the fact that it's a fair use suggests that they should be able to, but DMCA and license issues of course, come into play making this an open question.

234
00:52:48.001 --> 00:52:51.771
Taylor Faires: Another question that I think, is really important

235
00:52:51.981 --> 00:53:01.290
Taylor Faires: to and the context of generative AI is whether fair use permits text and data mining research on a corpus that wasn't legal when it was created like SciHub.

236
00:53:02.101 --> 00:53:12.210
Taylor Faires: Since what it really permits is breaking digital locks for the purposes of conducting text and data mining research, DRM, free materials might be good candidates for this kind of research,

237
00:53:12.631 --> 00:53:30.220
Taylor Faires: but researchers can understandably being uneasy about doing their research. Particularly if it's research that they'd like to publish on corpora, there weren't legally compiled. Finally, how are we to navigate this license issue and their interactions with the text and data mining exemption?

238
00:53:30.651 --> 00:53:40.281
Taylor Faires: As I said earlier in many cases, and particularly where it comes to ebooks, copyrighted works come with licenses that forbid breaking DRM for any purpose.

239
00:53:40.461 --> 00:53:52.791
Taylor Faires: So how can researchers contend with that requirement, while still being empowered to do the research that they want to do, contributing to the scholarly discourse and the project, the progress of knowledge serving the purposes of copyright.

240
00:53:53.301 --> 00:54:03.921
Taylor Faires: And how can scholarly communications librarians, scholarly communications officials, and librarians support these researchers and contend with license terms that they may not have much control over.

241
00:54:04.711 --> 00:54:08.160
Taylor Faires: Okay, on that cheerful note, we will open it up for questions.

242
00:54:19.371 --> 00:54:40.690
Taylor Faires: So I think we can do questions a couple of ways. There are a handful of these sort of table mics available, and if you just press the main button you can speak into them. But I do have a more portable mic for those that are not near one of those table mics. But yeah, are there questions and it might make sense-

243
00:54:41.591 --> 00:54:48.900
Taylor Faires: Would you like to stand up here? Yeah, that would make the most sense. Thank you so much. Again, yeah, definitely.

244
00:54:52.591 --> 00:55:11.800
Taylor Faires: So one thing I'm, can I ask you a question? So I'm curious about the extent to which you are either using or getting questions about artificial application, artificial intelligence applications in research. And just kind of what those look like. What those questions look like to you

245
00:55:18.651 --> 00:55:22.751
Taylor Faires: hang on. I my name's Jessica Harris

246
00:55:23.511 --> 00:55:32.871
My name is Jessica Harris. I'm the electronic resources management librarian here at the library, and I negotiate a lot of our licenses free content and

247
00:55:32.921 --> 00:56:00.301
Taylor Faires: I don't really have an answer to that question, but I will say that starting towards the end of last year, we started seeing that language against AI pop up, in all of our license agreements, or a lot of our license agreements. So it's something that we're working to try to negotiate out. But it's it's an uphill battle. Yeah, that I've heard that before from others. And I think one of the things that's happened with text data mining research, it used to be this kind of niche like thing people weren't paying that much attention to

248
00:56:00.301 --> 00:56:07.630
and the publishers weren't all that concerned about it until they realized they could make some money licensing, and then they would, you know, sell these nice

249
00:56:07.691 --> 00:56:25.360
Taylor Faires: hefty licenses. But all of a sudden it's become almost political, like they they are terrified of losing out on dollars from artificial intelligence applications and text data mining has kind of gotten sucked into that

250
00:56:25.681 --> 00:56:28.701
and I think we've seen that at the

251
00:56:29.151 --> 00:56:38.800
Taylor Faires: like before the copyright office, we're asking for this expansion. And all of a sudden there's this surge of interest from

252
00:56:38.911 --> 00:56:39.681
kind of

253
00:56:39.961 --> 00:57:07.021
Taylor Faires: publishing and motion picture lobby groups indicating like they're paying a lot of attention. They didn't even like blink last time around with the exemption for the first exemption for text data mining. And now, all of a sudden, this is like a big deal to them. I will say the first 2 licenses that we got that had that language. They both have AI products, that they're yeah

254
00:57:07.061 --> 00:57:17.610
Taylor Faires: vendors who are publishers who are doing an awful lot of open access publishing now, and it's a little perplexing because you you know, those

255
00:57:17.651 --> 00:57:30.421
Taylor Faires: like a large chunk of that subscription content is also OA, and they have permissive licenses. So presumably, you know, that content's out there and available for training, but not in their sandbox or their ecosystem

256
00:57:37.921 --> 00:57:39.370
Taylor Faires: Other responses to

257
00:57:39.471 --> 00:57:41.461
Taylor Faires: this question. Yeah.

258
00:57:46.021 --> 00:57:51.870
Taylor Faires: I can personally say, I have not gotten a question about the legality of using

259
00:57:52.931 --> 00:58:08.690
Taylor Faires: using content to train AI. I have gotten not a lot of questions around the legality of or like whether or not AI is copyrightable. But I have answered that question.

260
00:58:08.841 --> 00:58:12.710
Taylor Faires: but then it's it's pretty piecemeal from the questions I've gotten

261
00:58:16.541 --> 00:58:17.651
Taylor Faires: .

262
00:58:18.461 --> 00:58:23.571
Taylor Faires: I wonder if there's

263
00:58:23.931 --> 00:58:28.951
Taylor Faires: what you guys think about how you know works that are

264
00:58:29.001 --> 00:58:32.151
acquired through licenses or works that

265
00:58:32.191 --> 00:58:46.361
Taylor Faires: you know currently fall under the DMCA exemption, where they would be purchased by a researcher and digitized, you know, in house, and thus are, you know, under those under those terms, sort of owned by the the institution.

266
00:58:46.371 --> 00:58:49.190
Taylor Faires: if those kinds of works are used

267
00:58:49.341 --> 00:58:52.690
Taylor Faires: for training data or whatnot like would that?

268
00:58:52.881 --> 00:59:05.120
Taylor Faires: Does that feel like? Was that that'd be like a separate category from since this wouldn't necessarily enter into at least under the current exemption that there seems to be a separate allowance for

269
00:59:05.351 --> 00:59:30.520
Taylor Faires: things that are owned and digitized like within the institution and then kept within the institution. Right? Yeah, I mean. So you've got 2 layers of analysis. Well, really, 3 layers of analysis. Right? So first, you have copyright law, and what does it allow you to do, and what kind of defenses exist, and fair use is a clear one. And I think that clearly applies to those sort of applications

270
00:59:30.521 --> 00:59:39.281
Taylor Faires: regardless really of ownership. It's more about the the use. Then you've got the DMCAissue. Are you breaking DRM.

271
00:59:39.601 --> 00:59:58.391
Taylor Faires: And if you are, do you fit within one of that those exemptions. That does require the institution to own the copy. What's interesting is, it doesn't say when the institution has to own the copy. And so you know, you could have a transfer happen kind of at any time.

272
00:59:58.721 --> 01:00:11.591
Taylor Faires: So I don't know that that's a critical. You know it does does the institution have to own the copy before the circumvention happens? I don't I don't know that it defines that, but that's the second layer, and then the third layer is

273
01:00:11.711 --> 01:00:24.981
Taylor Faires: contracts and contracts are what? Just sort of buggers everything up because they can go in and say, that's great, you've got some fair use rights. And the copyright office said, you can do this, but we say no, and if you violate our contract

274
01:00:25.061 --> 01:00:28.620
Taylor Faires: you know we're never gonna give you access to anything ever again.

275
01:00:28.641 --> 01:00:30.801
Taylor Faires: I think realistically, that's the biggest

276
01:00:30.881 --> 01:00:36.161
Taylor Faires: threat for researchers and universities is not

277
01:00:36.361 --> 01:00:46.450
Taylor Faires: a big scary lawsuit or other kind of enforcement action. I think the biggest threat is that it gets

278
01:00:47.001 --> 01:00:57.200
Taylor Faires: really hard when you have a really big, important vendor who's gonna say you do it by our terms, or we're not gonna license to you. And

279
01:00:57.221 --> 01:00:59.151
Taylor Faires: I think

280
01:01:00.091 --> 01:01:04.230
Taylor Faires: like when I talk with people at the copyright office, and when I talk with

281
01:01:04.811 --> 01:01:06.121
folks

282
01:01:06.771 --> 01:01:16.870
Taylor Faires: who aren't as familiar with university licensing and how that works, they seem to think that big institutions have a lot more sway than they do. And yeah.

283
01:01:17.121 --> 01:01:29.260
Taylor Faires: like, I, I've been in that position trying to negotiate for a big kind of similarly sized institution. And you know, we have more sway than like the community college down the street, but

284
01:01:29.301 --> 01:01:51.781
Taylor Faires: still not that much sway. And if you know Elsevier or Wiley, or you know, go down the list decides that this is an important business interest, they won't hesitate to say, "I'm sorry. I know that your faculty are gonna demand reading access to this. I'm gonna force you to agree to these terms for this new stuff." And that's just what it is. And

285
01:01:52.241 --> 01:01:56.551
Taylor Faires: I mean, it's hard when you have a kind of must have resource.

286
01:01:56.971 --> 01:02:03.221
Taylor Faires: because if you, the only good way to negotiate those is being able to walk away, and it's

287
01:02:03.411 --> 01:02:06.690
Taylor Faires: very difficult to walk away from some of those resources.

288
01:02:08.081 --> 01:02:26.601
Taylor Faires: Sorry were you? Were you talking also about about scanning a book? Yeah, I was also interested in that that, too, because, you know, if if we we have, you know, take the libraries resources as they are, print copies, and if we digitize those that feel that's a separate channel. It's a lot more work, obviously. But

289
01:02:26.711 --> 01:02:53.620
Taylor Faires: it seems to open up possibilities and that that does take out both the license issue because you own the physical book and the DMCA issue because you haven't break broken any DRM. And, in fact, when we got when we initially got the exemption back in 2021 we talked a lot about OCR. And scanning as this kind of alternative that wasn't really adequate for most people. So the re, like the cumbersomeness of that is one reason that we got the exemption in the first place.

290
01:02:54.361 --> 01:03:10.371
Taylor Faires: So that it's for that reason that, you know, HathiTrust, for those of you who aren't familiar with like the HathiTrust Research Center, HTRC, Is a really awesome resource, because it's built on all of these scanned books. It doesn't provide as much

291
01:03:10.451 --> 01:03:16.731
Taylor Faires: flexibility as I think a lot of people wish because it's kind of constructed trying to keep

292
01:03:16.811 --> 01:03:45.910
Taylor Faires: rights holders at bay. They also have contracts with Google that they have to contend with which funded a lot of the digitization. But it at least cuts through the DMCA issues and most of the contract issues. It's a little different cause it's not a contract issue with the publishers. In that case, it's the contract with Google and Google has some interest in preventing kind of unlimited mass access because they have a commercial interest in using that data for their own purposes.

293
01:03:57.291 --> 01:04:21.230
Taylor Faires: So this is immediate to sort of data scanning at mass. But I had a question about copyright and critical editions is the critical addition of a classical work itself copyrightable claim, or just the commentary. And I mean with fair use that will make some of this irrelevant., but I'm just wondering if critical addition is like

294
01:04:21.571 --> 01:04:29.630
Taylor Faires: per se free and in the clear, or if you have to sort of make sure that you've got a fair use exemption. So for the underlying work

295
01:04:29.771 --> 01:04:31.730
Taylor Faires: like, if you're reproducing

296
01:04:32.611 --> 01:04:38.660
Taylor Faires: a book of all the whole volume. Yeah. So Plato is in the public domain. We didn't get into this

297
01:04:38.801 --> 01:04:59.481
Taylor Faires: timeline of the public domain. But when you're that old you don't have to worry too much. But a critical addition, yeah, the the text that you're adding to it the commentary, all of that does get a copyright in it. And so whoever created that would have the rights in it. And, in fact, that's the way a lot of publishers will kind of

298
01:04:59.481 --> 01:05:14.771
Taylor Faires: invent a copyright for works that are in the public domain is they'll kind of sprinkle a little commentary here and there, and not to diminish that, but a lot of people aren't reading it, really, for that. Critical editions are a little different cause they're like much meatier. But sometimes you'll you'll see like a

299
01:05:14.791 --> 01:05:26.561
Taylor Faires: kind of super light critical addition that just has a little bit here and there, and that's the way they try to achieve a certain amount of protection over it.

300
01:05:26.721 --> 01:05:34.740
Taylor Faires: Yes. And the translation definitely would be copyrighted. Yeah.

301
01:05:35.001 --> 01:05:36.571
Taylor Faires: just in the text

302
01:05:37.501 --> 01:05:58.481
Taylor Faires: yeah, so that that gets at a really important issue of the distinction between the the content that's added and the selection and arrangement. So if you have a really short statement like that, it's probably not independently copyrightable by itself. But when you put a whole bunch of those things together across

303
01:05:58.837 --> 01:06:18.090
Taylor Faires: a whole work that's where you probably do have some creative selection and arrangement of those elements. And so you would have a copyright in sort of the whole thing. And so so I think the upshot for downstream users is if someone reproduce just that comment, or maybe just that page,

304
01:06:18.091 --> 01:06:34.301
Taylor Faires: probably not a big deal. If they reproduce the whole thing, then they might start to run into copyright issues because it it encompasses the whole selection and arrangement of the work, and usually a good way to think about copyright and creativity in

305
01:06:34.471 --> 01:06:51.790
Taylor Faires:  in that context is, was there a real choice there about kinda how to put this together? And if if you have a lot of options and a lot of choices on how to express yourself, then there's probably at least some thin copyright protection

306
01:06:56.390 --> 01:07:13.450
Taylor Faires: That actually just one more point on that, Rachel raised this issue of how you publish your research in this context with text data mining. And that actually comes up a lot with like, how much is too much to reproduce when you're writing an article, or, you know, increasingly, people are

307
01:07:13.541 --> 01:07:32.671
Taylor Faires: wanting to share portions of their data set, or, you know, do something a little bit more. And it raises questions of you know, how much can you copy with commentary kind of layered over it and and still be okay. That's very context specific. I'd say definitely, if anyone's thinking about

308
01:07:32.671 --> 01:07:45.350
Taylor Faires: kind of reproducing and publishing a whole data set, that's where you wanna check in, you know, with somebody University Counsel or somebody to get a take on, and how much is too much, and what kind of

309
01:07:45.521 --> 01:08:11.581
Taylor Faires:  checks you want to have in place. So sorry. Go ahead. So my my question is on this theme of maybe sharing research or specifically thinking about collaborations between educational institutions and private companies. There's a lot of research, especially in GenAI, coming out where you know, Google brain folk. Open AI folks are working with academic researchers. My understanding is that the 2020 exemption will sort of

310
01:08:11.581 --> 01:08:31.501
Taylor Faires: cover the academics who want to access and sort of override some of this DRMs. Maybe this is a little bit about the scope, but have you thought about some of those collaborations. What are some of the legal parameters? Or at least, you know, maybe, as the academic, approaching this partnership that might then turn into some commercial vehicle. I'm just curious to hear your thoughts.

311
01:08:31.651 --> 01:08:52.290
Taylor Faires: Yeah. Well, so first, the DMCA exemption strictly applies to people at institutions of higher education, and the current exemption doesn't even allow sharing really outside. And so with a couple of minor exceptions like, if you're collaborating on that particular research project like, you know, you're writing a paper together

312
01:08:52.291 --> 01:09:07.601
Taylor Faires: or if someone's just coming in to do some some verification behind you. So all of that is limited, and there's no real possibility under the exemption of sharing with, you know, say Google or Openai, or whoever

313
01:09:08.311 --> 01:09:09.281
So

314
01:09:09.501 --> 01:09:16.101
Taylor Faires: And and even the expansion that we're asking for wouldn't go that far. We're just asking for sharing among

315
01:09:16.151 --> 01:09:28.450
Taylor Faires: institutions of higher education. So you know, if you're here at the University of Chicago, and then there's somebody like up at Northwestern who has a similar set of research questions, you could share the data with them.

316
01:09:28.461 --> 01:09:33.361
Taylor Faires: So that's the Dmca piece. But then there's the bigger question of like, what if I create a corpus?

317
01:09:33.561 --> 01:09:40.400
Taylor Faires: forget DMCA, nut just it was done under the auspices of fair use, and then I share it with a for profit company that wants to do

318
01:09:41.161 --> 01:09:44.670
Taylor Faires: something for profit with it. How does that change the analysis?

319
01:09:44.841 --> 01:09:57.090
Taylor Faires:  I don't really know. I think these these lawsuits are gonna get at that issue. And Rachel chime in if you have thoughts, but

320
01:09:57.411 --> 01:10:10.561
Taylor Faires: I think that's one of the bigger issues is, how much will the courts weigh commercial interest versus nonprofit interest? I think certainly nonprofit uses are the safest. But that being said, the courts have

321
01:10:10.561 --> 01:10:33.181
Taylor Faires: in numerous occasions said it's fair use to engage in these types of uses for commercial activity as well. Google books is like the prime example of that. Google didn't just do that out of the kindness of their heart. They got this massive data set that they could use to improve their search engine and probably in the back of their mind, like use for AI applications. And

322
01:10:33.181 --> 01:10:44.590
Taylor Faires: I don't know exactly what Google is doing internally, but I would be shocked if they're not using the Google books purpose for training some of their internal work on large language models.

323
01:10:46.061 --> 01:10:55.831
Taylor Faires: Yeah, commerciality is tricky. And I was just thinking as as you're asking your question, like, I think we should have said non commercial a lot more in this presentation.

324
01:10:55.891 --> 01:11:18.831
Taylor Faires: The under the exemption, like the exemption, is for non-commercial, nonprofit, academic uses and the copyright office and sometimes courts put a lot of emphasis on whether something is educational, nonprofit, or commercial, or a little bit of both. And I think the where, what's gonna happen with this little bit of both is a really interesting question that will play out in these lawsuits

325
01:11:18.851 --> 01:11:22.450
Taylor Faires: because, GenAI, it's like so often there's

326
01:11:22.581 --> 01:11:27.910
Taylor Faires: academic research and then there's a you know, private sector application?

327
01:11:28.571 --> 01:11:33.690
Taylor Faires: I don't know the answer, but it's it's gonna be interesting to see courts try.

328
01:11:34.211 --> 01:12:03.320
Taylor Faires: I think, a really big deal historically, and whether commerciality matters or not is how transformative the court views the use. So if they view it as essentially a non transformative use, it's just you know you could, somebody's out there already doing this and producing works for this purpose. Then, they have found commerciality to matter a lot more. So, where this gets really dicey, I've encountered this bunch of times where mostly with people in

329
01:12:03.321 --> 01:12:26.251
Taylor Faires: like business school or medical schools where they create like a workshop and they're doing something where they're incorporating images and text from other sources. You know, Powerpoint Slide, for example. And it's one thing if you're doing that at the University of Chicago, it's another thing when you go out and you charge you know, doctors

330
01:12:26.261 --> 01:12:43.880
Taylor Faires: $5,000 as part of a kind of ongoing, continuing education program for a for profit enterprise. And so that's where I get really nervous because it's not really a transformative use in those cases. But for transformative uses, the courts have been relatively willing to say, yeah, that those can be

331
01:12:43.891 --> 01:12:48.540
Taylor Faires: still okay when they're fair use. And Google books is a good example.

332
01:12:48.871 --> 01:13:00.091
Taylor Faires: Reverse engineering cases are almost always transformative, but also commercial. The big one was actually in the nineties a case where Sega, you know, Sega.

333
01:13:00.291 --> 01:13:17.181
Taylor Faires: Genesis game system sued a competitor who had reverse engineered some of their game cartridges so that they could do other cool things with them, and totally commercial, and also totally okay, because it was a totally new application for for that use.

334
01:13:23.671 --> 01:13:32.200
Taylor Faires: One more question on the gen generative AI stuff. Are you see? What are you seeing in terms of

335
01:13:32.751 --> 01:13:56.630
Taylor Faires: ou know clearly, like the commercial, you know products, OpenAI, ChatGPT, where there's a fear that works will be, you know, thrown into that data and then somehow resurface? I can see why that's very problematic. Are you seeing any conversation about kind of open source models and or distinction around that where, you know, a researcher could potentially have a local instance of an LLM

336
01:13:56.941 --> 01:14:02.210
Taylor Faires: that they use and thus are using? You know, again, it's a way to kind of wall off

337
01:14:02.391 --> 01:14:18.120
one's research from the the public and potential Co. Commercial uses. I'm just curious like how much of that there is a lot of discourse sort of in the air in the AI community. But how much of that are you do you hear about in terms of these discussions?

338
01:14:19.161 --> 01:14:39.770
Taylor Faires: I I've definitely heard of it. I haven't seen it yet, like, I haven't seen the product yet. Right? And so I mean for those of you unfamiliar. You kinda have this base underlying model, right? So Gpt, Gpt3, Gpt4. And then you have particular systems that are implementing that and giving people an opportunity to interact like ChatGPT.

339
01:14:39.771 --> 01:14:59.351
Taylor Faires: And so I have heard of some universities saying, could we set up our own sort of instance of you know a ChatGPT internally, that's walled off, and so we know that when people are asking it questions, that those questions, the prompts aren't being fed back into the data pool.

340
01:14:59.441 --> 01:15:12.331
Taylor Faires: That's definitely going on. I know I've had conversations with folks at there's a whole initiative at Columbia working on that and I think, Berkeley, there's a group of people working on that.

341
01:15:12.701 --> 01:15:40.560
Taylor Faires: Yeah. Michigan had a whole big thing set up. Everyone's kind of racing to do that right now. The other thing I've heard is actually an effort to not just set up their own systems, but to develop their own data, sets to train their own models, which is a much bigger undertaking. But the concern is if these models are potentially gonna get really expensive or disappear because of these lawsuits,

342
01:15:40.561 --> 01:15:49.001
Taylor Faires: maybe it's a good idea to create a kind of public interest nonprofit version of this. Perhaps that's focused on

343
01:15:49.001 --> 01:16:11.600
Taylor Faires: a legally cleaner or more defensible data set. So, for instance, instead of going out and using books3. That is, pull pulling books from all these pirate websites like, could we find a pathway to use books that have been digitized that have been purchased by libraries, for example, and digitized under a clean, fair use rationale.

344
01:16:12.201 --> 01:16:33.481
Taylor Faires: That's a really big undertaking, but I think there's a lot of interest in that, both from a kind of self preservation standpoint as well as a public interest standpoint of with the idea that it's a bad idea to have all of these models developed and controlled by companies that may well end up in kind of a

345
01:16:33.681 --> 01:16:40.851
Taylor Faires: oligopoly monopoly sort of position, because they're the only ones who can afford to fend off the lawsuits.

346
01:16:46.281 --> 01:16:51.270
Taylor Faires: I had a quick question, kind of going off of what Hoyt said, but I've I've read a little bit about

347
01:16:51.551 --> 01:17:05.491
Taylor Faires: people using or wanting to use, like artificial data sets like when you don't have enough, you make more. That's sort of derivative, but representative of your much smaller data set. Does does that come into play in this conversation at all?

348
01:17:06.561 --> 01:17:23.851
Taylor Faires: Yeah, I've heard some about that using synthetic data, basically and it seems to me you still have an underlying copyright question somewhere in there. It's just that you're one or 2 or 3 orders removed from it.

349
01:17:25.641 --> 01:17:29.060
Taylor Faires: I don't. I haven't figured out kind of what the

350
01:17:30.431 --> 01:17:49.860
Taylor Faires: what that lawsuit would look like, though. But yeah, that's it's definitely a trend. And one of the concerns, though, is, if you use too much synthetic data, you start to have lose some of the value of being able to kind of reflect human language. And it it.

351
01:17:49.971 --> 01:17:58.941
Taylor Faires: you get these weird hallucinations like you already see in some of the other systems. Do you have any thoughts on? Have you thought of much about that synthetic data issue?

352
01:17:59.101 --> 01:18:00.341
Taylor Faires: It's interesting that

353
01:18:04.551 --> 01:18:07.451
Taylor Faires: there is a trend to

354
01:18:08.251 --> 01:18:24.801
Taylor Faires: to really push heavily on public domain content and things for which there are like open licenses that has its own challenges. Actually, we're doing a little project right now with some people who are interested in writing about some of the bias

355
01:18:24.831 --> 01:18:50.021
Taylor Faires: that gets introduced when you only use kind of researcher at Georgetown calls it was it low friction, low friction data. Which is kinda just means stuff that's easy to get your hands on and you're not gonna get sued over. And when you do that you could introduce all sorts of biases like if you just did public domain content, which we know is mostly old stuff. You mostly exclude

356
01:18:50.211 --> 01:19:15.290
Taylor Faires: a whole diversity of writers who have produced content over the last 100 years or so. And so it's like, if you want to know how rich white dudes from England would have talked in, you know, the nineteenth century, you can focus on public domain content, but if you want something to be a little more robust and reflective of society today you have to have other content introduced in there

357
01:19:18.721 --> 01:19:23.480
Taylor Faires: We have time. We are pretty close. Yeah. Do you need time for one more question.

358
01:19:23.621 --> 01:19:25.080
Taylor Faires: .

359
01:19:26.091 --> 01:19:28.610
Taylor Faires: .

360
01:19:30.161 --> 01:19:31.381
Taylor Faires: One more question.

361
01:19:33.391 --> 01:19:50.521
Taylor Faires: cause I'll take it otherwise. I will make a little advertisement. So if you're interested in this stuff, we were doing a lot on it. So we have a whole set of things that we're working on this this spring.

362
01:19:50.521 --> 01:20:11.810
Taylor Faires: including more work on this DMCA exemption. We're also working on this AI kind of thing around bias in data sets. And we're gonna have some more things spun up over the next year. So stay in touch. If it's of interest to you. And you know, check out our website, because,

363
01:20:11.981 --> 01:20:15.761
Taylor Faires: it's a really dynamic area. And I think right now, there's a lot of

364
01:20:15.821 --> 01:20:45.571
Taylor Faires: uncertainty and kind of fear around it that's driving a lot of these lawsuits and decisions. And I think for academic uses, what I'm really concerned about is that these lawsuits are gonna make it look too scary for administrators. And so they're gonna kind of shut down some of the cool research that could be done with this that maybe not as legally scary as you know what you see out in the world with OpenAI and and these other things where

365
01:20:45.571 --> 01:20:57.850
Taylor Faires: there's much bigger targets on their back, and reasons why they are being sued that are unlikely to be persuasive reasons for those same rights holders to sue a university

366
01:21:02.151 --> 01:21:03.391
Taylor Faires: .

367
01:21:05.301 --> 01:21:06.631
Taylor Faires: There's great meeting deal.

368
01:21:07.341 --> 01:21:18.981
Taylor Faires: I think. I think we're good. Then. I'm gonna give this last 3 min back to everybody. I know it's been an hour and a half. Thank you so much for coming.

369
01:21:19.101 --> 01:21:28.691
Taylor Faires: I don't think I'm being hyperbolic when I say this is very, very helpful and a wonderful sort of demystifying the entire realm around this.

370
01:21:29.571 --> 01:21:32.310
So one more round of applause for

371
01:21:35.231 --> 01:21:38.231
Taylor Faires: thank you so much for coming. Yeah, thank you all.

