Charge a student $800 in tutition and fees to attend a class. Student writes 3 good papers in the class. License the three papers to 10 AI models for $45 for a total of $450. College increases student revenue by >50%. This setup is clearly unfair and feels predatory. Anyone that has been in a graduate lab that was working on a project actually worth money has seen this side of college before.
However, this new marketplace introduces an interesting line of questioning. Can undergraudates/graduate students generate enough novel training data to cover their tuition? Will the 21st century scholar most valuable output be high quality material for LLMs to digest.
I'd love to see a metric for determining how valuable a document is to an LLM. How accurate is the data, how much new knowledge does it add, how relevant is it to users of the LLM. If we can value a document accurately we can create a marketplace for individuals to license their creative output.
> Charge a student $800 in tutition and fees to attend a class.
Isn't that off by a factor of ~10? $800 x 4 classes/semester x 2 semesters = $6,400/year. I don't know how much U. of Michigan charges, but I expect it's closer to $64K than $6.4K.
I just checked and Michigian's in-state tuition is currently $17,228/semester[0]. Yeah, cost of tuition is very wrong. Using this new number -- which is just a more accurate estimation -- the student tuition for the class is $4,300 and ten licenses for an output of 3 papers would be a ~10% increase in revenue per student.
Graduate tuition is usually even more expensive than undergrad. So I bet graduate students are paying ~15-20% more than this from my previous experience of graduate tuition.
This data dump is loading research papers, which is going to primarily come from Graduate/Doctoral students
It's kinda nice to see things like this causing us to rethink the
value of our time and relations with each other. Even slaves just work
for nothing. Students are the only group of people who pay to work.
People really do not understand the value of their data. It’s incredible. They think in 10s and 20s, but the reality is 1,000s.
Wouldn’t it be nice if we owned the work we created instead of always having it assigned to the network nearest extractive company or institution?
Big money understands. Just look at how agritech data is getting locked up by companies — the value of the data is astounding.
Much of the university stuff is free for research or non-profit use, which is great, but if you want to use it commercially you have to pay big bucks. And not everyone who uses it commercially pays, let’s be honest. Ethical behavior isn’t the persuit of everyone in the management class.
There's an exchange of value - the organization that can use your data provides some value to you which is why use use it instead of just doing things on your own. You can actually just do things on your own and keep all the IP for yourself, even academic research in many low-capital-intensive fields - as long as you're actually doing it for the science and not for stupid credentials or stupid zero-sum career progression like most academics seem to.
I've often used university research commercially. I just read a published paper and apply it to my business. No permission, no payment, it's just wide open free for anyone.
I'm surprised they are paying for this. There's a Swedish company just requesting this from schools since it's public information so they have to provide it by law.
US-based federal entitles have a completely different set of rules (See: https://www.foia.gov/). And the educational system get federal funds, so some of its data is public. There are likewise public records laws at the state level in many cases which open up even more data to the public.
Effectively, GDPR only applies to EU nationals. US folks have no such rights and anyone even remotely associated with an institution like U of M are forced to agree to whatever data terms and policies U of M wants.
It contrasts with FOIA and public records acts, of course, so there is a wealth of rabbit holes to go down on exactly where lines get drawn, but saying "no such rights" is simply inaccurate. If you work with public institutions you get pulled in both directions.
What they should be selling is a sport franchise license. Disconnect sports from the university but keep the facade. Let another entity incur the costs that will be coming down the road but still get the other benefits, financial and otherwise. I'd love to say that this would bring down the cost of an education but I wasn't born yesterday.
Most university athletic programs are unprofitable, if examined solely through the lens of outlays vs direct revenue. It's of course harder to measure the consequences on recruitment, alumni attachment to the school, etc.
> The 2020 report found only 25 Division I programs had revenues exceeding expenses. No Division II or III program had revenues exceeding expenses. There are 1,102 Division I, II and III schools.
I was active in budget committees as a student, and while my alma mater was in Division I, it was always in the red. It's a really hard decision for the student services committee every year to raise fees or cut funding, and the pressure from the athletics program to cut their budget last was intense. We tried to keep fee increases at or under the "higher education price index", but that itself is a flawed measure as it consistently is greater than CPI.
It's not clear how the "Catalyst Research Alliance" is associated with University of Michigan. It's not hosted on umich.edu, there's no information on https://catalystresearch.io [0] about any agreements or partnership with UMich, and there's not even any information at all on any people running "Catalyst Research Alliance" -- which is normal for shady SaaS startups but very very unusual for academic research alliances. It's not clear that Catalyst Research Alliance obtained the rights to "re-sell" the database.
I believe the datasets are available for free here on UMich.edu[1]: So it's not clear that "CRA" has a "right" to "license" these to anyone. CRA also "licenses" this freely available genomics database: https://ctdbase.org [2]
IANAL, but per LinkedIn[3], the CRA founder Doug Eisner is one (Boston University Law, 1992). Perhaps he knows something I don't about selling licenses to other people's intellectual property.
The MICASE license at umich.edu states:
MICASE Fair Use Statement[4]
MICASE is owned by the Regents of the University of Michigan, who hold the copyright. The database has been developed by the English Language Institute, and the web interface by Digital Library Production Services. The database is freely available at the MICASE website for study, teaching and research purposes, and copies of the transcripts may be distributed, as long as either this statement of availability or the citation given below appears in the text. However, if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required.
MICUSP Fair Use Statement[5]
The Michigan Corpus of Upper-Level Student Papers (MICUSP) is owned by the Regents of the University of Michigan (UM), who hold the copyright. The corpus has been developed by researchers at the UM English Language Institute. The corpus files are freely available for study, research and teaching. However, if any portion of this material is to be used for commercial purposes, such as for textbooks or tests, permission must be obtained in advance and a license fee may be required.
The Comparative Toxicogenomics Database[6]
CTDTM is provided to enhance knowledge and encourage progress in the scientific community. It is to be used only for research and educational purposes. Medical treatment decisions should not be made based on the information in CTD. Any reproduction or use for commercial purpose is prohibited without the prior express written permission of the MDI Biological Laboratory and NC State University.
So the HN headline is slightly wrong. The University of Michigan is making this data available for free for non-profit or academic research purposes, and also offers custom commercial licensing for the datasets. It's possible that CRA is running a legitimate license reselling service with dubious value-add services on top, but it's definitely not clear what their relationships with these universities really are.
Great research! Hilarious that the dataset link you provided just happens to have the exact same number of texts available for download as are available in this "commercial" dataset:
> Hilarious that the dataset link you provided just happens to have the exact same number of texts available for download as are available in this "commercial" dataset
I don't understand - didn't the GP say it is the same database?
In case it wasn't obvious, I meant no sarcasm when I said great research. The GP did a fabulous job digging up context and researching where this data may have been sourced from. Yes, they did say it looked like the same dataset (I assume the GP has not actually purchased the commercial data set to confirm, but like me, saw that the number of texts is coincidentally exactly the same between the two)
I did very much appreciate your appreciation. Mostly though I was just a bit more thorough than others would be. If I was clever at all, it was only in identifying what exactly would be best to copy/paste. I keyed in on 'Michigan Corpus of Academic Spoken English' and 'Michigan Corpus of Upper-Level Student Papers' on this page[0] and the simplest copy/paste google searches led me directly to the UMich page for the datasets which share that exact name[1].
I did notice the rest of the details were identical, but honestly had already lost good-faith and stopped 'evaluating' once I saw the perfectly matching dataset names. I have very little faith in the tech / SaaS industry, and generally I am quick to assume people are grifters. Particularly for sites with no social proof established by reputation from a wide userbase and a convincing 'about us' with actual people's bios and real contact information.
I get irrationally angry about for-profit license violations of 'free-ish' intellectual property, because of how hard capitalists come down on people like Aaron Swartz, Kim Dotcom, and scores of teenagers using Napster or Bittorrent when I was young. So I plunge a bit deeper into them than is rational. I strongly identify with superficially irrational efforts like Naomi Wu's[2].
It always makes me sad to see a university selling out so hard. Profit profit profit! Learning? Knowledge? Education? Only if it leads to more profit for students… but especially if it means more profit for the university.
EVERYTHING in America is designed to fuck you as hard as possible unless you can pay. Literally everything is about incentivizing you to give as much of your personal cash up as possible to lessen the shit being shoveled onto you.
This is what happens when you abandon people and enshrine "greed is good" as a National philosophy for 40 years.
Unfortunately, like seemingly every aspect of North American culture, this philosophy is being exported and accepted in other countries too, like Egypt as one example.
“Patterns of behavior that were highly regarded in a more stable society such as sticking to one’s word or promise, pride and personal integrity, are now less prized. Such values are less fit for a rapidly changing society where loyalty to old relationships, be they friends, spouses, places or principles appears as nothing more than an excessive sentimentality unbecoming in one who is on the make.” - Galal Amin, in Whatever Happened to the Egyptians