Harvard Opens Up Its Massive Caselaw Access Project

from the good-to-see dept

Almost exactly three years ago, we wrote about the launch of an ambitious project by Harvard Law School to scan all federal and state court cases and get them online (for free) in a machine readable format (not just PDFs!), with open APIs for anyone to use. And, earlier this week, case.law officially launched, with 6.4 million cases, some going back as far as 1658. There are still some limitations — some placed on the project by its funding partner, Ravel, which was acquired by LexisNexis last year (though, the structure of the deal will mean some of these restrictions will likely decrease over time).

Also, the focus right now is really on providing this setup as a tool for others to build on, rather than as a straight up interface for anyone to use. As it stands, you can either access data via the site’s API, or by doing bulk downloads. Of course, the bulk downloads are, unfortunately, part of what’s limited by the Ravel/LexisNexis data. Bulk downloads are available for cases in Illinois and Arkansas, but that’s only because both of those states already make cases available online. Still, even with the Ravel/LexisNexis limitation, individual users can download up to 500 cases per day.

The real question is what will others build with the API. The site has launched with four sample applications that are all pretty cool.

  • H2O is a tool that law professors can use to easily create casebooks for students in various areas of law. Anything published on H2O gets a Creative Commons license and can then be shared widely. I wonder if professors like Eric Goldman, who offers an Internet Law Casebook, or James Grimmelmann, who has a different Internet Law Casebook, will eventually port them over to a platform like H2O.
  • A wordcloud app that currently shows the “most used words” in California cases in various years. Here, for example, are the word clouds in California cases from 1871… and 2012. See if you can tell which one’s which.
  • Caselaw Limericks that appears to randomly generate what it believes is a rhyming limerick from the case law. Here’s what I got:

Her son Julius is a confirmed thief.
He did not turn over a new leaf.
The vessel, not.
the parking lot.
Respondent concedes this in its brief.

    The quality overall is… a bit mixed. But it’s fun.
  • And, finally, in time for Halloween, Witchcraft in Law, which totals up cases that cite “witchcraft” by state.

Hopefully this inspires a lot more on the development side as well.

Filed Under: , , , , ,
Companies: harvard, lexisnexis, ravel

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Harvard Opens Up Its Massive Caselaw Access Project”

Subscribe: RSS Leave a comment
Anonymous Coward says:

Re: Non-Free

6b: By submitting your User Content to the H2O Services, you also agree to allow H2O to license your Content under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License

That only means that people other than the copyright holder are not allowed to use the data in a book, if the obtain it under that license. The contributor is free to license or sell their own works as part of a commercial enterprise. Similarly, anybody with a commercial enterprise in mind is free to approach the copyright holder to obtain a license that permits commercial use, they just have to live with and compete with the creative commons version.

Thad (profile) says:

Re: Re: Non-Free

Yes, but the "noncommercial" CC licenses are considered non-free, by the definition of free culture licenses. See the NonCommercial interpretation page on the CC Wiki:

NC licenses do not qualify as “open licenses” under the Open Definition, and works licensed under an NC license are not considered Free Cultural Works. This may be important if you want others to further distribute your work on Wikipedia, Wikimedia Commons, or other platforms requiring a license that meets the Open Definition or the Definition of Free Cultural Works.

Anonymous Coward says:

Re: Re: Re: Non-Free

This may be important if you want others to further distribute your work on Wikipedia, Wikimedia Commons, or other platforms requiring a license that meets the Open Definition or the Definition of Free Cultural Works.

If that is what the person wants, they distribute the work via one of those platforms under a suitable free license, or if they have already done so, they can submit to the project under an NC license.

Nothing in the rules stops a copyright owner distributing a work under several licenses.

Anonymous Anonymous Coward (profile) says:

Re: Huh?

Obviously you have never had it mixed a la minute (which means there is less than a minute between adding the dressing to the cabbage and whatever other veggies you like in your cole slaw, mixing it and serving it). Crisp vs marinated and soggy.

Thing is, eventually all of whatever Harvard harvests will be re-harvested and then become freely available and unencumbered. Though the fancy apps might not be included. Seems to me some folks were doing this with Pacer, or some other unreasonably encumbered system.

500 downloads per day would only need the cooperation of 13,000 people for one day to capture the entire database. There certainly could be permutations of people and days. To think that anyone might be able to control this pubilic infomation beyond the download (incorporating in the apps is different) would be incredulous.

keithzg (profile) says:

Re: Re: Re-publishing and archiving

In terms of PACER, I think the main effort has been https://free.law/recap/ ?

And yeah, I was thinking, someone should definitely coordinate this. I’d certainly run a CRON job on one of my systems to pull down another 500 downloads per day, orchestrated to avoid duplication of effort by some central server like how bitcoin mining pools work.

Anonymous Coward says:

But how will censorship be properly implemented?

Court records contain much information that, at least judging by Wikipedia administrator standards, is unfit for public view. The identities of rape and sexual assault accusers would be the most obvious examples of forbidden knowledge, even when, as in the case of Julian Assange, names that have been repeateldy printed in newspapers all over the world are meticulously scrubbed off Wikipedia the instant they (re)appear. To a slightly lesser extent, the same goes for “ordinary” personal information about most anyone (i.e., “doxing”) obtained through publicly available online sources, including court records.

Despite lacking any actual political power, this online “encyclopedia” could be viewed as a kind of democracy in action, a way to determine what kind of information could be considered fit for public consumption and what is not. And judging by Wikipedia’s standards, much of the information in court records is considered private and thus must not be seen by the public (even when it can already be easily found on the internet).

So the question is, will these court documents get reviewed and scrubbed of personally identifying (and potentially embarrassing) information, and perhaps even corrected to meet modern day standards of social etiquette (like using the “correct” pronouns), or will this be a kind of massive data leak sure to upset everyone from traditional privacy advocates to modern social-justice activists?

Anonymous Anonymous Coward (profile) says:

Re: But how will censorship be properly implemented?

There is a difference between what a private corporation, Wikipedia, publishes, and what are public documents, no matter how hard to find. Wikipedia could be sued many, many times, and whether there is a case or not they would have to defend themselves, even if the infomation was considered public.

Public documents on the other hand, whomever posted them, are not actually actionable, though there are some in the EU that might differ with that.

Maybe if Wikipedia opened a set of public documents pages and then linked to that it might preserve some of the legal angst that would come their way if they didn’t. Then again, maybe not.

I will be looking forward to hearing here on Techdirt about the lawsuits from folks in the EU against Harvard for the publication of these documents, even though those lawsuites should go nowhere.

Christenson says:

Next App: Which cops perjured themselves??

That is, there are any number of weasel words to say that a cops testimony is questionable. Seems to me a new app could mine the corpus for the names of all the cops, then look for when a judge determined that the cops testimony was “not credible”, and spit out the name of the cop and citations.

Fermina Fato (user link) says:

Dig Up the Dirt on Case. Law & Case Law Project

What can I say about these cavalier assholes? They published a 20 year old expunged case, which I am sure they paid someone to get, or perhaps got from that other dirtbag, Leagle.com, whom I know paid someone to get. CAP refuses to take the down, even tho it hasn’t appeared in the Reporter for over 20 years, & contrary to Leagle’s assertion that the the case remains in the Appellate record, which the 4th DCA confirmed in writing that it is not. The internet is rife with complaints about Leagle, but we’re here about CAP, whom even after being provided our State Laws, MA state laws, Federal Laws, Case law & the MA CORI, refuse to take the case down & only offered to blur the first name. The ACLU & CCResource Project have other things to say.

Add Your Comment

Your email address will not be published.

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...