About The Ratings
Once upon a time, we had a truly novel and clinically insane algorithm for calculating WhatNotToSing.com performance ratings. It involved a basic starting algorithm based on simple statistics and data modeling, followed by manual adjustments using like 'comparison points' -- e.g., how two performances were (sometimes contradictorily) viewed by two well-established sources. In the days when American Idol was a worldwide phenomenon, there were no shortage of "well-established sources" across the Internet.
Times change. AI viewership has declined, and so did public forums where people posted and shared their opinions on each night's performances. Finding enough web sources became so challenging that we broke down and instituted a direct means of communication with fans of Idol: direct reviews of each competition-night episode from an established but open set of reviewers. This became known as our "Review Crew", because every musical artist worth their salt has at least one reductive, rhyming name for their social media followers, and we felt we were just doing our part.
The Crew has been worth its weight in rhodium to us, not only because they were a source of consistent, timely, high-quality reviews, but also because it allowed us to track more individual rating reviews (as opposed to ranking reviews, which we'll explain later) than we could even at AI's peak of popularity. Over time this resulted in us simplifying our algorithm greatly -- we do not, for example, do much of anything these days based on 'comparison points', and most but not all of the calculations and adjustments now fit neatly into an Excel spreadsheet, albeit a complex one.
The one drawback is that the integrity of modern WNTS ratings is heavily dependent on maintaining continuity with our most established reviewers. Or, in simpler terms, if Group A of 50 reviewers grades one episode, and Group B of 50 brand-new reviewers grades the next one, we're toast. We can still calculate relative scores for the second episode, but any correlation of its ratings to the first show's ratings would be completely coincidental. A 50 the second night might be equivalent to a 30 the first night, or a 70, or against all odds maybe even a 50, but we'd have no way of knowing. Ah, but really it's no big deal -- honestly, it would only be a problem if something completely ridiculous happened. Like, say, owing to a slew of issues arising from a pandemic, we took three full seasons off. But that'll never happen, so why worry about it?
Ahem. Anyway, we've updated this page (and greatly simplified it) with the current algorithm we use. The system works well, provided there are no breaks in continuity. We've retained the much simpler executive summary of the performance ratings on our Intro to WNTS 101 page, though it's a bit outdated today, and there's more on our Frequently Asked Questions page. We'll get around to updating those pages someday.
Though the content and performance ratings on our site are copyrighted by us, our rating process described below (which is based entirely on basic statistical principles and practices) is freely available for you to use and modify, subject to the terms of the GNU Lesser General Public License. It works well, but it's labor-intensive and we think there's still room for improvement. We'd be happy to hear from you if you can suggest ways to make it better.
Prologue: What's It All About, Alfie?
Before we discuss how we calculate the approval ratings, let's take a moment to discuss what they really are....
WhatNotToSing.com approval ratings measure how well (or not so well) a performance was received by the Internet fans of American Idol. A common misconception is that approval ratings are synonymous with how "good" or "bad" the singer was that night. In reality, singing quality is just one factor (albeit the most significant one) that affects the final number. Others include song choice; presentation; the judges' remarks (back in the days when they weren't universally fawning); the contestant's personality, composure, and performance history; and how well other contestants performed that night in comparison.
We'd love to isolate and rate separately all the factors that go into a performance. But, we don't know how to do that, at least not objectively. Thus we rate the performances as a whole and allow the Idolsphere to interpret why a particular performance received a particular number. This is why we avoid using terms like "better" and "worse" on the site when comparing performances – we use phrases like "higher-rated" and "less-liked" instead.
The goal of Project WNTS is to provide a consistent, objective comparison basis between performances, episodes, and seasons. If we do this successfully, fans and future Idol contestants can make intelligent decisions based on hard facts, instead of silly bromides (like "Never sing Aretha Franklin!") that don't stand up to analysis.
As for what separates a "good" vs. "bad" performance, that's entirely up to you. If you believe that Summertime was the worst performance in AI history and one with a single-digit approval rating was the best, our official response is: "whatever." There are a few performances that America rated in the 30s and 40s that we happen to think were terrific, and occasionally one in the 90s whose popularity we still don't understand. No matter. America decides, we report.
Ratings and Voting
This is important: Do not mistake WhatNotToSing.com approval ratings for a voters' poll. Ratings are a measure of how well viewers liked or disliked a performance. We make no attempt whatsoever to predict how they'll actually cast their ballots.
American Idol voting patterns are extremely complex, as anyone who's watched the show for any length of time already understands. For better and for worse, performance quality is just one piece of the puzzle. What matters most is how much a contestant motivates America – particularly his or her fanbase of loyal supporters – to cast a vote for them before the final commercial break. (Perversely, sometimes the best way to do this is to perform spectacularly badly, though that's not a well you want to go to too often.)
Fortunately, there are a large cadre of "free agent" voters out there who cast their ballots primarily for the evening's best performances. Singing well will thus almost – almost – always see you safely through to the next week. Even more encouragingly, there is a very strong correlation between approval ratings and long-term survival on the show. But sometimes the non-performance forces are so strong that there is simply nothing a contestant can do to overcome them. Historically, one performance from Season Two illustrates this phenomenon best: Band Of Gold.
We now return you to your regularly scheduled article.
Chapter One: The Idolsphere
WhatNotToSing.com performance ratings were once based entirely on publicly-posted opinions we found on the World Wide Web: blogs, newspaper articles, and forums; from message boards, roundtables, chat room logs, online polls and feeds of every type imaginable, most of which were not solely dedicated to American Idol. Some have dubbed this virtual world of American Idol fans "The Idolsphere". Today, however, the ratings are heavily based on Review Crew ballots, though we still poll several public sites and music critics' articles, and we are always on the lookout for more.
What we don't use, and never have: our own opinions, the judges' and mentors' critiques, the official AI website (not that they have a forum anymore), fansites or social media groups aligned with a particular candidate, and the occasional person or source that we just plain don't want to be seen visiting. That last one is used very sparingly; the only truly blacklisted site we've ever had was a forum back in the day with a dedicated Idol chat page that had some pretty good reviews and seemed innocent enough...until we discovered that it was actually part of a board for a neo-Nazi organization. You can't make this stuff up.
We used to have a rule that all opinions that were counted towards approval ratings had to be published or received before the episode's eliminations were announced, to avoid having voters' emotions seep into their critiques. Today, of course, votes are tallied and eliminations announced before the episode is finished, so the old "24-Hour Rule" is pretty much worthless. We do emphasize to reviewers, however, to keep it real each and every week, even if your favorite contestant just got shockingly booted.
Finally, ratings for Season One were calcluated retroactively in 2008 because there were not enough surviving web sources to calculate them any other way. The show really never took off on social platforms until the next season. We thought this would be a one-time fill-in-the-blanks effort, but we took time off during and after the pandemic, so retroactive ratings are on the agenda once again.
Chapter Two: Rankings, Ratings, and Critiques
Performance reviews on the Web come in many flavors. The more descriptive and intelligently written they are, the more we love and revere them (and the greater weight they ultimately have on the overall WhatNotToSing.com ratings.) Here's a sampling of what we use, from least sophisticated to most.
Ranked Reviews
On message boards and chat rooms, simple rankings are most popular. Fans express their opinion of the evening's performances by placing them in order from best to worst, e.g.:
1. Chris
2. Taylor
3. Katherine
4. Elliot
5. Ace
6. Paris
7. Kellie
These are useful, but they're not terribly descriptive. Was Chris as much better than Taylor as Taylor was to Katherine? Was Chris far better than Kellie, or were they all bunched in the "muddled middle"? With basic rankings, there's no way to tell - all we know is the order.
More sophisticated reviewers go a step further by expressing where they felt were clear delineations in the performance level, e.g.:
1. Chris
2. Taylor
(gap)
3. Katherine
4. Elliot
5. Ace
6. Paris
(big gap)
7. Kellie
This is better; now we know the reviewer felt that the seven performers fell into three strata, and that Chris and Taylor were a clear cut above the rest (and Kellie a distant cut below.) But there are still questions. Were Chris and Taylor really that good, or were they just the best of a mediocre episode? Was Kellie that bad, or just average on a night in which everyone else brought their A-game? Unless the reviewer includes some descriptive text (even if it's as little as "this episode stunk!") we can't even hazard a guess. We need more contextual information. Which brings us to....
Rated Reviews
These are what we value most and what we strongly encourage all Review Crew members to submit (though we accepted ranked ratings with gratitude, too). Here, reviewers not only provide ordinals for the evening, but they attempt to assess each performance qualitatively. For example:
Chris - 10 out of 10
Taylor - 9.5
Katherine - 8
Elliot - 8
Ace - 7.5
Paris - 6.5
Kellie - 3
Now we're getting somewhere. According to this reviewer, Chris wasn't just OK on an otherwise bad episode. He was very good in the general scheme of things – presumably, having been assigned a maximum 10 of 10 rating, as good as it gets.
Better still, if this reviewer grades all episodes using the same scale, we now have an idea how each performance fares relative to performances from other episodes. We'll discuss this more later on.
Can we infer also that Kellie had a very bad night? Not quite. The problem here is that every reviewer's scale is relative. Perhaps a '3' in this person's book is very bad, or perhaps it's poor but not awful. If you think we're splitting hairs, you haven't visited too many Idol blogs. It's not uncommon to read reviews along these lines:
Kellie: OMG that was the Worst. Singing. Ever!!!!
Was there a SINGLE NOTE in tune??
My ears are still bleeding! I thought I was going to throw up when she missed that high
note towards the end. Awful, awful, awful - if she starts singing again I'm going
to puncture my eardrums and crawl out onto my fire escape until it's over!!!!
5.5 stars out of 10.
Uh, excuse us, but...5.5 stars??! What would a 1-star performance sound like? Would it cause a mass extinction? Anyway, this (made-up) example just underscores the fact that we can't treat rating values as absolute and simply average the scores. We have to convert them as best we can to a common, meaningful scale - mathematicians call this process "normalization", if you're interested.
Chapter Three: Putting It All Together
OK, we've collected all these opinions. How do we turn them into numerical approval ratings on a scale of 0 to 100? Here's the approach we use. If you've found the reading to be pretty easy so far, we'd better warn you that the article is about to take a sharp turn for the technical. If you have math anxiety, just skip this chapter.
Step 1 - Ordinals
For each episode, we start by calculating the "ordinal" score of each performance by each reviewer, rated or ranked. If there are 10 performances, the one that the reviewer felt was the best of the night is awarded 9 points, because there were nine performances it "beat". The second-best gets 8 points, and so on to the worst performance, which scores zero. We do this for all reviews, then sum the results.
This step is actually far more complex than we just described, because we also adjust for factors such as ties, missed performances, "(gap)"s in a reviewer's list, multiple-song weeks (i.e., the Final 5 onwards), and a few other things. By the end of this step, we have a consensus ranking of the performances and a rough idea of how they rate in relation to one another.
Step 2 - Ratings
Next, we set aside all reviews that contained rankings only; there's nothing else we can glean from them. For the remaining reviews, we start the process over again, but this time we average the ratings of each performance. Because every reviewer uses a different rating scale (1 to 10, or 0 to 5 stars, or A through F), we convert them as best as we can to our 0 to 100 scale.
We should mention here that we treat some rated reviews more equally than others. The more ballots we've received from a person, the more confidence we can ascribe to what their 30 or 4.5 stars or B+ mean in the grand scheme of things. The weightings aren't linear -- closer to ∜n, in fact, where n is the number of episodes a person has rated. But, it does tend to make things considerably more accurate.
Step 3 - Normalization
Here's where things get tricky. We have to "normalize" the rated results so that they lie in a scale consistent across all episodes and all seasons. Our goal, in other words, is for a performance with an approval rating of 50 in the Season "X" Finale to be equivalent to that of one having scored 50 in the Season "Y" Semifinals.
If you've ever submitted a ballot via the Review Crew, even if it was 10 years ago, you have your very own row in our master database. Congratulations! We know your historical average rating and standard deviation, along with everyone else's. We also know, of course, what the WNTS average episode rating and s.d. happen to be. Normalization is the process of taking individual ballots and mapping them to the same scale -- in this case, a unit distribution where 0 is dead average, one s.d. above average is 1, two s.d.'s above is 2, one s.d. below is -1, etc.
What's 'standard deviation'? It's a very common statistical measure of how much variance there is in a data set. Say we have two performances with approval ratings of 50, each based on six reviewers' collective opinions. For Performance A, all six reviewers scored it as a 50. For Performance B, three scored it at 80 and three at 20, which averages to 50. The difference is in the standard deviation (often abbreviated as the Greek letter sigma σ): A's is 0, B's is nearly 33. The higher the value, the more differences of opinion there were among the Idolsphere. (For WhatNotToSing.com approval ratings, the average σ is about 18.)
We start by calcluating the episode rating, which is the easiest part: after normalizing each ballot, we merely compute the average of every single rating of every performance on the night. This will usually fall between -1 and 1. We convert that to our familiar 0-100 scale, and presto: that's the episode rating, and pretty much nothing that happens downstream can change it.
Next we do the same for the performance that finished first in the ordinals. That'll almost always be above 0, usually in the neighborhood of 2 to 3. And, we do it for the last-place performance as well. That gives us our rating range, and those get mapped to 0-100 as well.
If the distribution of the ordinals is more or less linear, we can pretty much stop here. We'll "float" the second-highest to second-lowest performances in rough proportion to their ordinal percentages, always maintaining the all-important episode average of course, and call it a night. We like nights like that.
Sometimes, however, the ordinals make clear that there was a ginormous gap somewhere in the set list. For example, there was the famous AI7 Final 12 show, where four contestants delivered performances that earned near-universal acclaim (including a rare three-way tie.) The issue there was that the fifth-rated performance that night was clearly a million miles behind the first four, and on top of that, the lowest-rated performance was a train wreck for the ages. In cases like that, life is harder. Wherever there is a big gap, we'll go through the process of calculating a "direct" rating for the performances on either side of the gap, then "float" the intermediate performances, again ensuring that the episode average comes out exactly where we pre-determined it.
Our spreadsheet does most of the work, but not all, especially on "gap nights". Those are the times when the overnight ratings come with the sort of consumer advisories you usually only hear on pharmaceutical commercials. In short, we'll warn you if we think the early numbers are dodgy. Incidentally, we DO have an algorithm that would automate this entire process and produce (we think) more accurate ratings based on direct calculations and some linear regression, but it's too much for Excel and Visual Basic. Someday we'll code it up and release it on the website, where Review Crew ballots can be entered online and the code will do all the work for us. Maybe.
Step 4 - Fine Tuning
We make some fine tunings here and there. These days, they are very rare, but back in Idol's heyday, we'd find some sites had gotten too biased for or against a particular contestant. Good times, good times.
At the end of the season, we make one final normalization to account for the fact that standard deviations of ratings have been creeping up over the years, the result of AI often belatedly allowing artists of more varied genres onto the show. Sooner or later they'll put a proper Hip-Hop artist into the Top 24, though we've been saying that for going on two decades now. Anyway, now that contestants who do (among other things) Electronica, Singer-Songwriter, Gospel/Christian, Emo, various flavors of Modern Pop, and Whatever The Hell Billie Eilish Is are now welcome on the show, s.d.'s of individual performances have gone up, which is perfectly understandable. Not every song these days is everyone's cup of tea.
To partly correct for this, in our final normalization, we "stretch" the distribution of all ratings in a season so that it roughly forms the same-shaped bell curve we saw in prior seasons. These days, this usually means that the top-rated performances go up a few points and the bottom-rated ones go down. It's not perfect, but we're dealing with a show that has been on the air for approaching a quarter-century. At that point, the stars on the website for that season turn gold, and those are the numbers that posterity will see.
Epilogue: Final Thoughts
To make life a little simpler for our readers, we often refer to approval ratings as ranging from 1 to 5 stars, in 20-point intervals. There's nothing significant about these intervals, mathematically or otherwise. They're just a convenient way to split up the scale.
Stars | Numeric Rating | |
---|---|---|
1 star | 0 to 19 | |
2 star | 20 to 39 | |
3 star | 40 to 59 | |
4 star | 60 to 79 | |
5 star | 80 to 100 |
Is it possible for a performance to wind up with a rating below 0 or above 100 after all adjustments are considered? Theoretically, yes. We'd guess the actual limits are roughly -3 and 103. The chance of a performance falling out of the 0-to-100 scale in reality, however, is virtually nil.
Thanks for having read this far! If you have further questions on the rating system or suggestions on how to improve it, we'd love to hear from you.
-- The Ratings Board of WNTS.com