Small typo




  • Apparently, the DC schools use some model from Mathematica as part of teacher evaluations. It's supposed to use standardized test scores to figure out how much teachers are affecting scores based on past performance. Or something. Anywho...apparently, the latest calculations had an error in the code, leading to bad results. I couldn't find any details beyond what Politico is reporting, something about a "missing suffix in the programming code." I can understand why Mathematica wouldn't want to let the details of the bug out into the wild, but it would be interesting.

    Looks like some teachers got inflated ratings and others rated lower than they should have been. After fixing it, a few teachers are getting their $15,000 bonuses for good performance, though no one will have to pay back an erroneous bonus. I don't think anyone was fired, but apparently some teachers left for other districts or even quit teaching altogether after getting bad results.


  • Discourse touched me in a no-no place

    @boomzilla said:

    I don't think anyone was fired

    @FT(first para of)TFA said:
    A single missing suffix among thousands of lines of programming code led a public school teacher in Washington, D.C., to be erroneously fired for incompetence, three teachers to miss out on $15,000 bonuses and 40 others to receive inaccurate job evaluations.



  • @PJH said:

    @boomzilla said:
    I don't think anyone was fired

    @FT(first para of)TFA said:
    A single missing suffix among thousands of lines of programming code led a public school teacher in Washington, D.C., to be erroneously fired for incompetence, three teachers to miss out on $15,000 bonuses and 40 others to receive inaccurate job evaluations.

    Oops. Not sure why I said that. I read that and another article that said the same thing. I think I over interpreted the stuff about people leaving without being fired.



  • While evaluating performance purely based on some metric is always bad, I don't think there is a huge WTF here, although they blow it up like everything was entirely wrong. It looks to me as some rounding error (e.g. wrong rounding method) or small percentage bias was there, based on:

    The recalculations produced “very small differences” in individual teachers’ scores, Devaney said
    Half were rated higher than they should have been and half were rated lower, Kamras said.

    If that is true then its just a normal human error like they happen everywhere. The firm looks to have proper QA procedures and did the right thing when it discovered the error... What else can you do? Sure, its unfortunate for the edge cases but then again they are edge cases for a reason.



  • @dtech said:

    While evaluating performance purely based on some metric is always bad, I don't think there is a huge WTF here, although they blow it up like everything was entirely wrong. It looks to me as some rounding error (e.g. wrong rounding method) or small percentage bias was there, based on:

    The recalculations produced “very small differences” in individual teachers’ scores, Devaney said
    Half were rated higher than they should have been and half were rated lower, Kamras said.

    Yeah, although for the guy who got fired or the guy who almost got screwed out of a $15,000 bonus probably feels like this wasn't a small error. Since this is the DC school system we're talking about, there are undoubtedly many more WTFs lurking right around the corner. Then there's the fact that the metric itself is iffy:

    @TFA said:

    When New York City calculated value-added ratings last year, city officials acknowledged that a teacher rated at the 50th percentile of her peers might actually have been as low as the 23rd — or as high as the 77th, a huge margin of error that persisted even when the city used three years of student test data to smooth out bumps.

    I think that there's value in rewarding effective teachers and not rewarding duds, but this particular metric is problematic at best.



  • @boomzilla said:

    @TFA said:
    When New York City calculated value-added ratings last year, city officials acknowledged that a teacher rated at the 50th percentile of her peers might actually have been as low as the 23rd — or as high as the 77th, a huge margin of error that persisted even when the city used three years of student test data to smooth out bumps.

    I think that there's value in rewarding effective teachers and not rewarding duds, but this particular metric is problematic at best.

    I guess that to actually isolate the teacher's performance, you'd need to normalize the data around the students' abilities?

    You'd have to magically know how well the student would have done if the teacher had done nothing.

    In other words, we need a control group, a generation of children left entirely to fend for themselves.



  • @Buttembly Coder said:

    @boomzilla said:
    @TFA said:
    When New York City calculated value-added ratings last year, city officials acknowledged that a teacher rated at the 50th percentile of her peers might actually have been as low as the 23rd — or as high as the 77th, a huge margin of error that persisted even when the city used three years of student test data to smooth out bumps.

    I think that there's value in rewarding effective teachers and not rewarding duds, but this particular metric is problematic at best.

    I guess that to actually isolate the teacher's performance, you'd need to normalize the data around the students' abilities?

    You'd have to magically know how well the student would have done if the teacher had done nothing.

    In other words, we need a control group, a generation of children left entirely to fend for themselves.

    But the control group would need exactly the same distribution as the actual classes. Which means they would need to be from an alternate universe created at the beginning of the school year from a copy of this one. That can get expensive.



  • @Ben L. said:

    @Buttembly Coder said:
    @boomzilla said:
    @TFA said:
    When New York City calculated value-added ratings last year, city officials acknowledged that a teacher rated at the 50th percentile of her peers might actually have been as low as the 23rd — or as high as the 77th, a huge margin of error that persisted even when the city used three years of student test data to smooth out bumps.

    I think that there's value in rewarding effective teachers and not rewarding duds, but this particular metric is problematic at best.

    I guess that to actually isolate the teacher's performance, you'd need to normalize the data around the students' abilities?

    You'd have to magically know how well the student would have done if the teacher had done nothing.

    In other words, we need a control group, a generation of children left entirely to fend for themselves.

    But the control group would need exactly the same distribution as the actual classes. Which means they would need to be from an alternate universe created at the beginning of the school year from a copy of this one. That can get expensive.

     


    The alternative to that would be, rather than creating a control group with the same distribution as the actual classes, control the abilities of the students in the actual classes to a known amount - forcing the system to a known state rather than testing its state.<br />

    I see no reason that this couldn't be done efficiently and cheaply.



  • Let me tell you about a small secret about standardized tests: They're not worth their salt.

    I mean, what would you say if a programmer got payed based on his LOC?

    Seriously, though, there are quite a lot of problems with measuring human performance and using the numbers you get from those tests for anything other than a very rough indicator whether there's a problem or not, will yield problems.

    Just look at the father of it all: The IQ. It's a nice number - and doesn't tell you anything at all on its own. Want to know if the guy is able to deal with humans? IQ can't tell you that. Want to know if he can do math? Well, IQ can't tell you that either. Languages? Nope. Sports? Hell, no.
    All this little number tells you is: This guy (or gal) may learn and comprehend stuff faster - if you hit the conditions under which the guy is able to learn.

    I'm a teacher myself and I absolutely hate the reverence people have for our German magic number: The Abitur Grade. They're lumping all the grades from the past two years and all the grades from the last exams into one magic number - and expect it to tell us something about the pupil himself.

    For that reason also, I absolutely hate having to grade my pupils on this moronic 0 to 15 points scale. I mean, how am I supposed to determine wether this effort was worth 8 or 9 points? (I'm not talking about any kind of written tests here!) And then they tell me to simply grade my pupils after every lesson! Yeah, right, as if grading them often with a flawed instrument will yield so much better results. That's like using a broken scale and hoping that if you measure often enough, it will eventually average out right.

    Gah.



  • @Rhywden said:

    And then they tell me to simply grade my pupils after every lesson! Yeah, right, as if grading them often with a flawed instrument will yield so much better results. That's like using a broken scale and hoping that if you measure often enough, it will eventually average out right.

    Isn't that exactly what you try and do with a meassurement that is accurate but poor precision?  I mean I've no idea if the scale you are complaining about would fit that descriptor (probably not), but lots of meassures with something flawed can get things to average out.



  • @locallunatic said:

    @Rhywden said:

    And then they tell me to simply grade my pupils after every lesson! Yeah, right, as if grading them often with a flawed instrument will yield so much better results. That's like using a broken scale and hoping that if you measure often enough, it will eventually average out right.

    Isn't that exactly what you try and do with a meassurement that is accurate but poor precision?  I mean I've no idea if the scale you are complaining about would fit that descriptor (probably not), but lots of meassures with something flawed can get things to average out.

    Yes.  However, it's not what you do when you're a novel editor, and your boss has told you to calculate the number of words your authors have used by weighing printouts of the book, and subtracting the weight of the pages, then dividing the remaining weight by the weight of the ink used in one standard word.  And your instrument to weigh the printouts is an ancient, rusty balance.  Not sure if any of that is actually a relevant metaphor, but just saying.

    Oh, and, btw, before you say that method could work... think "pictures."  Unless, maybe, you think they are worth a thousand words...



  • @locallunatic said:

    Isn't that exactly what you try and do with a meassurement that is accurate but poor precision?
     

    Standardized tests are not an accurate measurement of.......... well, that leads to the central problem of measuring education, nobody agrees on WTF is the wanted result - now, you go and measure it.

    Anyway, if standardized tests were accurate (I'm sure they accurately measure something, just not something usefull), that kind of metric would be the perfect instrument for measuring teacher performance. Answering Ben L, it does have a control group - the control are all the other classes that use the same tests - and it's the best control group available.

     



  • @Mcoder said:

    @locallunatic said:

    Isn't that exactly what you try and do with a meassurement that is accurate but poor precision?
     

    Standardized tests are not an accurate measurement of.......... well, that leads to the central problem of measuring education, nobody agrees on WTF is the wanted result - now, you go and measure it.

    Anyway, if standardized tests were accurate (I'm sure they accurately measure something, just not something usefull), that kind of metric would be the perfect instrument for measuring teacher performance. Answering Ben L, it does have a control group - the control are all the other classes that use the same tests - and it's the best control group available.

     


    But the control groups aren't correctly done. There are too many independent variables for the experiment to mean anything.



  • @Ben L. said:

    But the control group would need exactly the same distribution as the actual classes. Which means they would need to be from an alternate universe created at the beginning of the school year from a copy of this one. That can get expensive.
    Only on Windows. On Linux, fork() is pretty cheap.



  •  I just write a new cosmos in Excel.



  • @locallunatic said:

    @Rhywden said:

    And then they tell me to simply grade my pupils after every lesson! Yeah, right, as if grading them often with a flawed instrument will yield so much better results. That's like using a broken scale and hoping that if you measure often enough, it will eventually average out right.

    Isn't that exactly what you try and do with a meassurement that is accurate but poor precision?  I mean I've no idea if the scale you are complaining about would fit that descriptor (probably not), but lots of meassures with something flawed can get things to average out.

    We have a saying in science: "Wer oft misst, misst Mist." Pretty hard to translate, but it amounts to: "If you measure shit often, you measure shit often."



  • @Rhywden said:

    "Wer oft misst, misst Mist."
     

    But there's no shit in the first part.

    And surely you don't say that gettings lots of data is always Mist?



  • @dhromed said:

    @Rhywden said:

    "Wer oft misst, misst Mist."
     

    But there's no shit in the first part.

    And surely you don't say that gettings lots of data is always Mist?

    It's a play on words and the way they're pronounced because "Mist" and "misst" sound exactly the same.

    And this saying is not meaning to say that lots of data is always bad. It's just a reminder that measuring stuff often purely in the hopes of increasing accuracy is ... misguided.

    I mean, for results from interpreting data to be meaningful you have to ensure three criteria to be fulfilled as closely as possible:

    • Validity
    • Reliability
    • Objectivity

    Measuring human performance pretty often fails all three of those criteria:

    • Objectivity: Numerous sources of error right there. Besides the teacher playing favourites, you also have stuff like the Halo effect, for example.
    • Reliability: If you repeat an exam for a second time, will you get the same result? Pretty much a no. No to mention the problem of edge cases or rounding errors - at which point do you deem an answer to be worth a point if it doesn't match your predefined answer perfectly? And, no, we can't always do multiple choice. Because that would run into the last criteria:
    • Validity: Do we actually measure the stuff we want to measure? We'll always have crosstalk from other disciplines - if you can't communicate your ideas you'll have problems in science, even when languages and science are deemed separate disciplines. I also mentioned another example: IQ. You have to be pretty careful to interpret that number, Or even more ambiguous stuff like "creativity" - how do you measure that? Not to mention that most tests also test how well the pupil does tests - and yes, there are people who know a lot but are absolutely frightened of doing a test, panicking even...

    The example sources for errors I named after the criteria are not complete, by the way. They'e just that: examples. Further up in this thread, there's stuff like a comparison or a control group. There actually are tools in Psychology to prevent some of those errors - but those won't work in a school setting due to either being impractical or too time-consuming.

    Due to my required studies in science into the topic of error measurement and my own interest in the accuracy of grading, I think I've got a pretty good idea of what goes and what does not. And my conclusion is that measurements of human performance are usually only usable for a very narrow area - if they're done right. If they aren't, you might as well not have bothered in the first place. Because either you'd just have done better with rolling dice or you only get results which amount to "Might be good. Might be average. Might be bad."



  • The teacher who was fired for an “ineffective” rating in fact should have been ranked “minimally effective,” Kamras said.
    In my opinion there's nearly no difference between those two, so he probably would've been fired sooner or later anyway. Just like a student getting 49% instead of 50%, there's no real difference there; he's a failure either way.



  • @spamcourt said:

    The teacher who was fired for an “ineffective” rating in fact should have been ranked “minimally effective,” Kamras said.
    In my opinion there's nearly no difference between those two, so he probably would've been fired sooner or later anyway. Just like a student getting 49% instead of 50%, there's no real difference there; he's a failure either way.

    According to our rules, 50% is a passing grade, so you're pretty much wrong in the legal sense. On the other hand, your sentiment of 49% and 50% not being really different shows the underlying problem, or better: stupidity of such ranking systems.



  • Is the Halo problem when people get sick just in time for big game releases? Has that phenomenon been studied?



  • @Rhywden said:

    Let me tell you about a small secret about standardized tests: They're not worth their salt.

    Speaking as someone who has achieved an unusually high score on both an IQ test and the SAT (for those who don't know, a standardized test given to high school students in a lot of the US), let me tell you: Standardized tests are a pretty accurate measure of how good someone is at taking that kind of standardized test.

    I don't especially believe that's a useful thing to measure, but...

    Incidentally, based on my own highly anecdotal evidence, practicing and training to achieve a better score on the SAT (particularly the English portion) reduces one's measured IQ.



  • @henke37 said:

    Is the Halo problem when people get sick just in time for big game releases? Has that phenomenon been studied?

    I know you're joking but it's probably a problem some of you have run into before, namely whenever you complain about someone who is prominent in meetings in front of your boss and not-so-prominent when it comes to actual work done while the boss is not around.
    That's the Halo effect in a gist:

    For instance, if you ever find yourself in front of a jury, make it a point to be sharp-dressed and invest in some professional cosmetics plus a visit to a hair salon. You'll most likely reduce your sentence and the chance of being convicted at all.


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.