Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] License University of Michigan's Databases of Academic Speech and Student Papers (catalystresearch.io)
87 points by lopkeny12ko on Feb 15, 2024 | hide | past | favorite | 52 comments


A statement by the University of Michigan has been made: https://publicaffairs.vpcomm.umich.edu/key-issues/licensing-...


So the OP is (probably illegally) reselling free stuff that was published on the web 10 years ago. How did this get so much traction so quickly?


1. I am not affiliated with the organization in the submission.

2. It gained traction quickly because monetization of an egregious violation of student privacy is discussion-worthy.


This has to be a joke right?

-----------------------------------------

One-time License Fee

Academic Speech Dataset: $15,595

Academic Papers Dataset: $12,595

Combined Academic Speech & Papers Datasets: $25,000

---------------------------------------------------

Speech Dataset: 85 hours of high-quality audio recordings in mp3 format

Papers Dataset: 829 student papers in txt format


Charge a student $800 in tutition and fees to attend a class. Student writes 3 good papers in the class. License the three papers to 10 AI models for $45 for a total of $450. College increases student revenue by >50%. This setup is clearly unfair and feels predatory. Anyone that has been in a graduate lab that was working on a project actually worth money has seen this side of college before.

However, this new marketplace introduces an interesting line of questioning. Can undergraudates/graduate students generate enough novel training data to cover their tuition? Will the 21st century scholar most valuable output be high quality material for LLMs to digest.

I'd love to see a metric for determining how valuable a document is to an LLM. How accurate is the data, how much new knowledge does it add, how relevant is it to users of the LLM. If we can value a document accurately we can create a marketplace for individuals to license their creative output.


> Charge a student $800 in tutition and fees to attend a class.

Isn't that off by a factor of ~10? $800 x 4 classes/semester x 2 semesters = $6,400/year. I don't know how much U. of Michigan charges, but I expect it's closer to $64K than $6.4K.


I just checked and Michigian's in-state tuition is currently $17,228/semester[0]. Yeah, cost of tuition is very wrong. Using this new number -- which is just a more accurate estimation -- the student tuition for the class is $4,300 and ten licenses for an output of 3 papers would be a ~10% increase in revenue per student.

[0] https://admissions.umich.edu/costs-aid/costs


Graduate tuition is usually even more expensive than undergrad. So I bet graduate students are paying ~15-20% more than this from my previous experience of graduate tuition.

This data dump is loading research papers, which is going to primarily come from Graduate/Doctoral students


It says the papers are from juniors, seniors, and grad students, or something like that.


In-state tuition 16,736 USD, Out-of-state tuition 55,334 USD

Edit: sibling comment is more accurate and provided a link. Ignore these numbers, but note they are close.


It's kinda nice to see things like this causing us to rethink the value of our time and relations with each other. Even slaves just work for nothing. Students are the only group of people who pay to work.


I think the question I'm interested in is:

Whether the students at University of Michigan are allowed to opt out of their own data sales, or:

Request a percent license of the profits from the University selling their work afterward?


Many second tier colleges are having a recruiting problem. I think a license share would be an appealing selling point to attract new students.


People really do not understand the value of their data. It’s incredible. They think in 10s and 20s, but the reality is 1,000s.

Wouldn’t it be nice if we owned the work we created instead of always having it assigned to the network nearest extractive company or institution?

Big money understands. Just look at how agritech data is getting locked up by companies — the value of the data is astounding.

Much of the university stuff is free for research or non-profit use, which is great, but if you want to use it commercially you have to pay big bucks. And not everyone who uses it commercially pays, let’s be honest. Ethical behavior isn’t the persuit of everyone in the management class.


If you do the math, it’s about $15/paper, not thousands. And the papers are really only valuable in aggregate.


I wasn’t making reference to the value of the papers, but the larger theme of people misunderstanding the value of their data.


There's an exchange of value - the organization that can use your data provides some value to you which is why use use it instead of just doing things on your own. You can actually just do things on your own and keep all the IP for yourself, even academic research in many low-capital-intensive fields - as long as you're actually doing it for the science and not for stupid credentials or stupid zero-sum career progression like most academics seem to.

I've often used university research commercially. I just read a published paper and apply it to my business. No permission, no payment, it's just wide open free for anyone.


It's a joke that they are selling it for so cheap.


The recordings in mp3 format, especially transcribed, could be quite valuable for testing.


I'm surprised they are paying for this. There's a Swedish company just requesting this from schools since it's public information so they have to provide it by law.


Student homework is public domain in Sweden?


Surprising, how does that square with GDPR?


US-based federal entitles have a completely different set of rules (See: https://www.foia.gov/). And the educational system get federal funds, so some of its data is public. There are likewise public records laws at the state level in many cases which open up even more data to the public.


Effectively, GDPR only applies to EU nationals. US folks have no such rights and anyone even remotely associated with an institution like U of M are forced to agree to whatever data terms and policies U of M wants.


> no such rights

Here is the US privacy act for individuals: https://www.justice.gov/opcl/privacy-act-1974

It contrasts with FOIA and public records acts, of course, so there is a wealth of rabbit holes to go down on exactly where lines get drawn, but saying "no such rights" is simply inaccurate. If you work with public institutions you get pulled in both directions.


What they should be selling is a sport franchise license. Disconnect sports from the university but keep the facade. Let another entity incur the costs that will be coming down the road but still get the other benefits, financial and otherwise. I'd love to say that this would bring down the cost of an education but I wasn't born yesterday.


I thought college sports are often profitable (or at least not that expensive) at least partially since not paying athletes saves them a lot of money


Most university athletic programs are unprofitable, if examined solely through the lens of outlays vs direct revenue. It's of course harder to measure the consequences on recruitment, alumni attachment to the school, etc.

https://www.insidehighered.com/blogs/just-explain-it-me/shou...

> The 2020 report found only 25 Division I programs had revenues exceeding expenses. No Division II or III program had revenues exceeding expenses. There are 1,102 Division I, II and III schools.

I was active in budget committees as a student, and while my alma mater was in Division I, it was always in the red. It's a really hard decision for the student services committee every year to raise fees or cut funding, and the pressure from the athletics program to cut their budget last was intense. We tried to keep fee increases at or under the "higher education price index", but that itself is a flawed measure as it consistently is greater than CPI.


The not paying college students thing is changing a lot with new name and likeness rules


It's not clear how the "Catalyst Research Alliance" is associated with University of Michigan. It's not hosted on umich.edu, there's no information on https://catalystresearch.io [0] about any agreements or partnership with UMich, and there's not even any information at all on any people running "Catalyst Research Alliance" -- which is normal for shady SaaS startups but very very unusual for academic research alliances. It's not clear that Catalyst Research Alliance obtained the rights to "re-sell" the database.

I believe the datasets are available for free here on UMich.edu[1]: So it's not clear that "CRA" has a "right" to "license" these to anyone. CRA also "licenses" this freely available genomics database: https://ctdbase.org [2]

IANAL, but per LinkedIn[3], the CRA founder Doug Eisner is one (Boston University Law, 1992). Perhaps he knows something I don't about selling licenses to other people's intellectual property.

The MICASE license at umich.edu states:

MICASE Fair Use Statement[4] MICASE is owned by the Regents of the University of Michigan, who hold the copyright. The database has been developed by the English Language Institute, and the web interface by Digital Library Production Services. The database is freely available at the MICASE website for study, teaching and research purposes, and copies of the transcripts may be distributed, as long as either this statement of availability or the citation given below appears in the text. However, if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required.

MICUSP Fair Use Statement[5] The Michigan Corpus of Upper-Level Student Papers (MICUSP) is owned by the Regents of the University of Michigan (UM), who hold the copyright. The corpus has been developed by researchers at the UM English Language Institute. The corpus files are freely available for study, research and teaching. However, if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required.

The Comparative Toxicogenomics Database[6] CTDTM is provided to enhance knowledge and encourage progress in the scientific community. It is to be used only for research and educational purposes. Medical treatment decisions should not be made based on the information in CTD. Any reproduction or use for commercial purpose is prohibited without the prior express written permission of the MDI Biological Laboratory and NC State University.

So the HN headline is slightly wrong. The University of Michigan is making this data available for free for non-profit or academic research purposes, and also offers custom commercial licensing for the datasets. It's possible that CRA is running a legitimate license reselling service with dubious value-add services on top, but it's definitely not clear what their relationships with these universities really are.

0: https://catalystresearch.io

1: https://lsa.umich.edu/eli/language-resources/micase-micusp.h...

2: https://ctdbase.org

3: https://www.linkedin.com/in/dougeisner/

4: https://lsa.umich.edu/content/dam/eli-assets/elidocuments/MI...

5: https://lsa.umich.edu/content/dam/eli-assets/elidocuments/MI...

6: https://ctdbase.org/about/legal.jsp


Great research! Hilarious that the dataset link you provided just happens to have the exact same number of texts available for download as are available in this "commercial" dataset:

> Showing 1 to 20 of 829 papers


> Hilarious that the dataset link you provided just happens to have the exact same number of texts available for download as are available in this "commercial" dataset

I don't understand - didn't the GP say it is the same database?


In case it wasn't obvious, I meant no sarcasm when I said great research. The GP did a fabulous job digging up context and researching where this data may have been sourced from. Yes, they did say it looked like the same dataset (I assume the GP has not actually purchased the commercial data set to confirm, but like me, saw that the number of texts is coincidentally exactly the same between the two)


I did very much appreciate your appreciation. Mostly though I was just a bit more thorough than others would be. If I was clever at all, it was only in identifying what exactly would be best to copy/paste. I keyed in on 'Michigan Corpus of Academic Spoken English' and 'Michigan Corpus of Upper-Level Student Papers' on this page[0] and the simplest copy/paste google searches led me directly to the UMich page for the datasets which share that exact name[1].

I did notice the rest of the details were identical, but honestly had already lost good-faith and stopped 'evaluating' once I saw the perfectly matching dataset names. I have very little faith in the tech / SaaS industry, and generally I am quick to assume people are grifters. Particularly for sites with no social proof established by reputation from a wide userbase and a convincing 'about us' with actual people's bios and real contact information.

I get irrationally angry about for-profit license violations of 'free-ish' intellectual property, because of how hard capitalists come down on people like Aaron Swartz, Kim Dotcom, and scores of teenagers using Napster or Bittorrent when I was young. So I plunge a bit deeper into them than is rational. I strongly identify with superficially irrational efforts like Naomi Wu's[2].

0: https://archive.is/QLqZT / https://llm-academic-speech-data.catalystresearch.io/

1: https://lsa.umich.edu/eli/language-resources/micase-micusp.h...

2: https://www.youtube.com/watch?v=Vj04MKykmnQ&vl=en


It always makes me sad to see a university selling out so hard. Profit profit profit! Learning? Knowledge? Education? Only if it leads to more profit for students… but especially if it means more profit for the university.


Did the students opt in?


Discourse on the web suggests students are not even aware of this. So probably not.


Where can I find some of the discourse?


I'm pretty sure there's a clause buried somewhere in something the students signed.


The american education system is absolutely insane—paying to get fucked!


EVERYTHING in America is designed to fuck you as hard as possible unless you can pay. Literally everything is about incentivizing you to give as much of your personal cash up as possible to lessen the shit being shoveled onto you.

This is what happens when you abandon people and enshrine "greed is good" as a National philosophy for 40 years.


Unfortunately, like seemingly every aspect of North American culture, this philosophy is being exported and accepted in other countries too, like Egypt as one example.

“Patterns of behavior that were highly regarded in a more stable society such as sticking to one’s word or promise, pride and personal integrity, are now less prized. Such values are less fit for a rapidly changing society where loyalty to old relationships, be they friends, spouses, places or principles appears as nothing more than an excessive sentimentality unbecoming in one who is on the make.” - Galal Amin, in Whatever Happened to the Egyptians


I wonder if students could put copyright clauses on their papers.


Copyright is always all rights reserved by default. No need to put a clause on a paper.


A license, perhaps?


Yes.


What is it with that university and questionable practices...


Can we have copyright recognized on prose works delivered orally in a public setting?


Typically it's the recording and transcript that's copyrighted.


How did they get ahold of the copyright?


They previously sold athlete data, no one should be surprised by this. https://www.nytimes.com/2016/09/11/sports/ncaafootball/weara...


Did not know that... Wow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: