New web app for calibration training

In October 2016, we wrote:

we are contracting [a developer] to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way.

That online application is now available:

Play “Calibrate Your Judgment”

Note that you must sign in with a GuidedTrack, Facebook, or Google account, so that the application can track your performance over time.

We expect many users will find this program to be the most useful free online calibration training currently available.

That said, we think there are several ways in which a calibration app could be more engaging and useful than ours, if someone were to invest substantially more development effort than we did. Some reflections on the challenges we encountered, and some lessons we learned, are available in this Google doc.

Spencer Greenberg, the lead developer of Calibrate Your Judgment, has released a paper describing the scoring rules used in the app, here.

Update April 23: In response to feedback, we have now improved the set of questions used for the app’s confidence intervals module, by removing hundreds of ill-formed or confusingly worded questions. We hope this leads to a better and more useful experience for users.

Comments

Could the app be open-sourced? I don’t expect OpenPhil to provide ongoing support for it or anything, but even just posting a tarball of the source would allow people to make their own improvements / updates.

(For example, the app chews through a surprisingly large amount of my CPU given that it’s mostly just menus, and I’d be interested in finding out why / reducing that effect.)

Hi Ben,

Thanks for your interest. The calibration app is built in GuidedTrack, which is not open source. However, we might (or might not) be able to share the code for the specific GuidedTrack program that is the calibration app — I’ll look into that.

Hi Ben,

The source code for the calibration app is now available on Github here. As you guessed, however, we can’t provide ongoing support.

Are there going to be any improvements made to the app? Some of the questions are oddly worded or don’t make sense, and I don’t see a way of giving feedback about them.

I have next to no knowledge of some of the things asked about in the questions, and so I end up providing a large confidence interval. This is the case for most of the sports questions. An example is “How many times Jeunesse Esch participated in the European Cup between 1955 and 1980?” Putting aside the questionable grammar, I have no idea who Jeunesse Esch is. I don’t even know what sport people play in the European Cup (soccer, probably?). So for me this question reduces to “How many times did someone compete in a sports championship over a 25-year period?” A few questions that require detailed domain-specific knowledge would be OK, but I feel that there are too many.

An example of an absurd (obviously machine-generated) question: “In what year was Blacks launched?” The answer is 1861. Apparently it was asking the American Civil War began.

Hi Patrick,

Yes, there are many questions in the database of questions we licensed from another provider that aren’t in the “goldilocks zone” of difficulty for most users, in the sense that the user knows a little bit about the question but isn’t already ~100% confident of the correct answer. This is flagged as something that could be improved about future calibration apps in the companion document I wrote, but it would require a large amount of manual work to improve. Personally, I try to answer these questions anyway so that I can (I hope) become better-calibrated about questions I know nothing about. But you can also use the “skip” feature explained in the introduction to the “confidence intervals” module (i.e. just type “skip” in the answer field).

As for ill-formed questions in the database of questions we licensed, it would also require a fair bit of manual work to identify and remove those, but I will reconsider whether to find someone to do that.

I like the concept, and the operation of the app seems OK, but the question set really does need some work.

I’m not sure of the value of attempting rather skipping the questions I know nothing about - learning soccer (sorry, football for those who insist on that name) trivia and history of US brand names is not a high priority for me.

Among the obscure questions, I was served up the following:

“How many English players joined the club in the 1980s?”

I guess (because there have been so many other Liverpool Football Club questions) that the club must be Liverpool, but it could be the Hellfire Club for all I know…

I’m skipping more than half the questions in the “Confidence Intervals” set, and I’m someone with good general knowledge who often does well in trivia competitions.

Hi Richard,

Thanks for reporting your experience.

As mentioned in a reply to Patrick above, we agree there are many questions in the database of questions we licensed from another provider that are ill-formed or aren’t in the “goldilocks zone” of difficulty for most users, and we are reconsidered whether to put in the manual work required to improve the question set.

Misunderstood questions can completely screw up the score. It asked me when The Three Musketeers was released and I answered 1750:1870, because I assumed it meant the book. But I guess it meant the film from 1993, so it gave me -21:-15 points (I don’t remember exactly). This probably neutralized the 0:2 points I got for each of the fourteen other questions that I answered.

I don’t mind giving large confidence intervals – that also happens in real life – but there should be a way to fix mistakes as the one above. Maybe a small budget of “don’t score this” flags?

By the way, this page and the book it is related to have a lot of high-quality calibration questions: https://www.howtomeasureanything.com/3rd-edition/ Perhaps you can make a licence deal.

Hi Richard,

Thanks for the feedback.

I don’t think we’re likely to invest in new feature development, but as mentioned above we are reconsidering whether to put in the manual work required to improve the question set.

Hello Luke

Do you have any insights about the transferability of calibration training across domains? Specifically, if one trains using this this web app, how likely are the benefits to carry over to domain areas not covered in the training dataset?

Hi Mario,

Unfortunately, little is known about this at this time. My guess is that transfer is small to moderate, depending on the domain, but for a few people transfer might be large. Much of whatever transfer occurs may come simply from getting an intuitive sense of what different probabilities “mean,” and observing how easy it is to be wrong at least ~10% of the time merely by misunderstanding the question or making a simple mistake, and both of those lessons might improve one’s calibration in a wide variety of domains.

In other words, my guess is that it’s important get calibration training in some way in domains you care about more than trivia, but it’s often difficult to set things up to get rapid precise feedback in more important domains, and so practicing on trivia questions can be a good way to start.

Thank you. This helps.

The app has some questions that are wrong or just syntax errors, such as:

- how old does ‘antique’ refer to (the app thinks 1000, but it’s 100, and guessing correctly is -57 points)
- what *country* is adidas from (it should be what *year*)

Thanks, Alok. We are currently in the process of weeding out poorly-formed questions from the database.

I’ve been getting this message every time I try to submit a response: “No internet connection. Cannot save your responses.”

But the page loads fine, so I don’t think it’s a problem with my internet connection. I have a wired internet connection, and other pages load fine.

This happen in Chrome and Firefox, and I keep seeing the same questions.

Thanks, Patrick. Could you please send your browser and OS details, and a screenshot if possible, to info@openphilanthropy.org?

I have the same issue. It’s been preventing me from using the app ever since, and means the ~10h I spent on it can’t be rendered as a graph anymore (the error seems to make the entire thing crash).

Using Windows 10 and Chrome 75.0.3770.100.

I’m also sending this to the email specified.

I was surprised you don’t provide a way to report mistakes (such as typos or ambiguities) in questions. Not only do most websites provide such a feature, I would particularly expect an organization concerned with clear thinking to invite feedback to learn from their mistakes.

Leave a comment