One WTF and one "I might be the WTF" WTF - graphing trends with flakey data!



  • Ok, so! Today I work I met this group who's making normal dashboard-y visualizations, and they're incorporating social data. One of their social data sources (for sentiment) I happen to know from experience has about a 60% accuracy rate. Which is crap, crap enough that I'd never think of showing it to a client. But these guys are.

    Here's a basic summary of the discussion we had:

    Me: "Hey what's the source for your X sentiment data?"
    Him: "We've built a bayesian filter for it, and we've seeded it by rating a couple thousand X items for it."
    Me: "Oh yeah I've tried that in the past on a lark-- but I found I couldn't do better than 60% accuracy, so I looked into some third parties who were doing it and none of them were doing better than 70%-ish. I'd never put something like that in front of a client and call it "data". Out of curiosity, what's the accuracy rate you're getting?"
    Him: "Uh..."
    His boss, taking over: "Well, the important thing is we give the visualization to the client and they can see the trends, the trends are there even if the accuracy isn't all that great."

    WTF #1: their developer had never bothered to measure (or had measured, but wasn't willing to share) his accuracy rate. Both options are bad.

    WTF #2: I think his boss's statement about being able to see the trends even with low-accuracy data is crap, because if he had bothered to draw the error-bars on the chart, they'd virtually always be crossing the sentiment center-line. (For what it's worth, my boss and one of my co-workers also agreed that the trendline is valid even if the error-bars cross the centerline.)

    Take a look at this image to demonstrate what I was trying and failing to verbally describe to them:

    So here's my question: it's been a long, long, long time since I took a statistics course. Am I right, or is "other team's boss" right?



  • in b4 the inevitable "you might be the WTF" and the barrage of useless "advice" for fixing the problem..

    Seriously, though, it sounds like they're just trying to pass off bad data by using pretty graphs. The other boss all but admitted that.



  • You're more right, but really what everyone should be doing with the data is comparing it to a suitable null hypothesis (which in this case might be "No one gives a rat's arse either way").

    I'm not sure whether to suggest lending them a copy of a book such as How to Lie with Statistics or whether that would give them ideas.



  • @pjt33 said:

    I'm not sure whether to suggest lending them a copy of a book such as How to Lie with Statistics or whether that would give them ideas.
    Sounds entirely like they might already be in possession of a copy...



  • You're right, in principle.

    What do you mean by "accuracy of 60%" though? If that's for an individual "item", and you're combining N items as a sample of the general population sentiment, your uncertainty is only 40%/sqrt(N), IIRC. If you already did that, and those are the actual data points and error bars, you're completely right. You should notice this in the data, as it will be very noisy (low correlation between adjacent points of the trendline). If you don't see that, their data is better than you think. But that's still no excuse for them to be totally unaware of the accuracy of their data.



  • @blakeyrat said:


    His boss, taking over: "Well, the important thing is we give the visualization to the client and they can see the trends, the trends are there even if the accuracy isn't all that great."

    So here's my question: it's been a long, long, long time since I took a statistics course. Am I right, or is "other team's boss" right?

    You're right. Hell, you don't need to be a stastician to see that inaccurate conclusions will be drawn from inaccurate facts. Just look at this website...

    Pedantic-dickweedly speaking, the boss isn't wrong: they can give the visualisation to the client and they can see trends. But your boss is going to look pretty stupid when the client asks the same question about accuracy and about interpretation.

    It sounds to me that the boss is trying to pass off inaccurate data as an indication of reality, which means that the client could be making terrible decisions based upon incorrect assumptions drawn from the diagrams presented. I think - especially in the US - there may be a legal case arising if that situation occurs. I know many UK adverts have been spanked for misrepresenting facts using creative diagrams.

     



  • @blakeyrat said:

    So here's my question: it's been a long, long, long time since I took a statistics course. Am I right, or is "other team's boss" right?
     

    Is that all the data? The lines are very coarsely segmented, so it seems like it's a tiny subset.

    If that is indeed all the data, then any random line in the pink area is 100% just as valid as the blue one. If there's much more to the left and right of the area, then the odds of the green one being more correct than the blue one drop significantly.



  • @blakeyrat said:

    So here's my question: it's been a long, long, long time since I took a statistics course. Am I right, or is "other team's boss" right?

    OK, so, ..not a true stats guy, but I play one on TV...

    <smoke><mirrors> There's a reason the "error-bars" are drawn around the blue line. It's in some way the "most likely". You can't rule out other values (with some probability) but (if the usual assumptions hold) the blue one is in some way "better" </mirrors></smoke>

    Can't say who'se "right", since we don't the context: is this the only decision-making input? What are the consequences of being wrong? 

    How are you measuring the "accuracy" of the method??

     




  • It's hard to say, without knowing more about the details of the data and the calculations and the interpretations. As to the error bars, it's probably worse than you think. My guess (again, without knowing the details) is that the error bars are the bars on a model parameter (e.g., the average of "sentiment"), and not for individual measurements of sentiment. That might not be a big deal in this case.

    But as an example, think of all of the average temperature charts you see. Then someone plots an individual temperature measurement and freaks out because it goes way beyond the error bars for the average. The basic takeaway is that we're generally more certain about some statistic than we should be.

    I don't really know what they're measuring and calling "sentiment" or how you'd be able to quantify 60% accuracy. I'd guess that's looking at individual statements and comparing the machine's assessment vs human? From an integrity standpoint, I think as long as you're open about what's being measured and displayed here, this could be reasonable. They have to understand that this is a very fuzzy measurement.

    If I were using this data, I'd be concerned whenever it wasn't far above the positive sentiment line. I think that in the beginning, I'd be trying to look into what's going on vs what the chart is telling me. It might only be useful to alert you to some catastrophe or confirm a major coup. As with any decision support system, it's important to remember that the human is still making the decisions. The automation is just helping him accumulate and analyze data. And for something like this, it sounds like it's more useful as an alarm from something gone out of control than a precise management tool.



  •  If the client sees value in it, who cares.... maybe the client has clients who expect/request this input data.  Maybe the data is crap now, but with improvement later on the accuracy can be improved - the sooner the client goes through this iterative step then the sooner they will learn. 

     If we have modern cars that work from one manufacturer, then why do other manufacturers produce occasional bad designs? Shouldn't they just be told that 'no - we have done this and it was crap'

     



  • @briverymouse said:

    What do you mean by "accuracy of 60%" though? If that's for an individual "item", and you're combining N items as a sample of the general population sentiment, your uncertainty is only 40%/sqrt(N), IIRC. If you already did that, and those are the actual data points and error bars, you're completely right. You should notice this in the data, as it will be very noisy (low correlation between adjacent points of the trendline). If you don't see that, their data is better than you think. But that's still no excuse for them to be totally unaware of the accuracy of their data.

    That's true if they are talking about the standard error of the mean population (i.e. how confident are you that the line represents the actual mean sentiment). If this were the case, the "error bar" width would change for each time measurement because it's unlikely that every time period had exactly the same number of sentiment items (I'm guessing they're doing a combination of review searching and social media data mining?). The data drawn looks like a standard deviation, which only tells you the likely range of occurrence for any individual sentiment value about that mean which would remain constant regardless of sample size.

    Blakey - assuming they are talking about standard deviations (since the band's width doesn't change) and not standard errors, your line would not be correct. All the band tells you is that the mean is not very defined for this given data set, not that the mean calculated is inaccurate. You would need the standard error to know how accurate the mean was.

    I hate citing wikipedia, but Standard deviation vs standard error



  • @Helix said:

     If the client sees value in it, who cares....

    I care... because:

    • the client is seeing data that is inaccurate
    • the client is getting conned by my graphical presentation
    • the client is seeing a misrepresentation of the true picture
    • the client is seeing value where value may not exist
    If you want to keep them as a client, don't attempt to pull the wool over their eyes.



  • JEBUS YOU PEOPLE ARE UP EARLY ON A FRIDAY

    Ok some responses:

    1) The drawing is something I popped off in Paint.Net in like 2 minutes, it's not data, its me doodling with my mouse.

    2) The data:
    a) Each data item is rated as "positive" or "negative" on a scale from -1 to 1
    b) Data samples are from a service where the length of the text samples are necessarily small, tons of people use sarcasm and abbreviations, and many messages are just copy-and-pastes of friend's messages. (Think: Twitter.) This is like a perfect storm of "stuff that makes Bayesian classification not very useful"
    c) I measured accuracy by rating a sample manually, then asking the program what it would have rated it. The DB, each time I rated a sample, kept track of its "guess" and my rating and 60% of the time we both agreed that the sample was positive. I don't believe the team I was talking to measured their error rate at all. (I think to be specific, I just checked how the most recent 100 "guesses" lined up with my most recent 100 ratings.)
    d) The way the graph is construction, you take the total number of samples over a time period (I wager a day in this case), total up each sample's score to come up with an aggregate score. This necessarily doesn't ever deviate too far from the "0" centerline.

    Maybe the way I'm visualizing the error bars is wrong. Like I said, it's been a long time since stats class.



  • @Cassidy said:

    @Helix said:

     If the client sees value in it, who cares....

    I care... because:

    • the client is seeing data that is inaccurate
    • the client is getting conned by my graphical presentation
    • the client is seeing a misrepresentation of the true picture
    • the client is seeing value where value may not exist
    If you want to keep them as a client, don't attempt to pull the wool over their eyes.

    Wow I agree with Cassidy for once. Yes, this is exactly how I feel too.

    They actually used the "well the client wanted this data" argument also, but at that point I was sick of debating it. The correct answer is to explain to the client why, no, no they don't really want that "data".

    I mean, it's fine for display in the building lobby as a fancy graph to impress visitors, but in this case it was mixed-in with a bunch of actual facts. What they're doing is clearly, to me, misleading to the client and unethical.



  • So, would a graph of data with an error margin be more accurately represented with the error margin (the pink area) represented as a gradient darkest around the trendline and fading off vertically? So that there's a lower probability that the green line is actually the case, but it's still quite possible...



  • @sprained said:

    So, would a graph of data with an error margin be more accurately represented with the error margin (the pink area) represented as a gradient darkest around the trendline and fading off vertically? So that there's a lower probability that the green line is actually the case, but it's still quite possible...

    It depends on what you are trying to communicate. If the goal is to describe how likely individual data points are given the mean, then that could be a valid way to display the information, but you would need to know the Probability Density Function. If your goal is to describe how confident you are that the number you've stated is the mean, then no.



  • Hey Blakey, I just figured out how you can substantially increase your salary. Substitute "sentiment data" with something like "50 day moving average" and you're a stock analyst. The data analysis is probably just as valid.


    Of course, if you have scruples over misrepresenting customer sentiment to a client, you might have a problem with using shady data to influence where people invest their money.



  •  The programs are distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY

     

    i accept the terms in the license agreement



  • @sprained said:

    So, would a graph of data with an error margin be more accurately represented with the error margin (the pink area) represented as a gradient darkest around the trendline and fading off vertically? So that there's a lower probability that the green line is actually the case, but it's still quite possible...

    Yeah, I'm really not sure. I need to go back to school I guess.



  • You're correct about the green line being a possibility given the margin of error, but I recall that by using some really fancy math using standard deviation combined with the number of variables used to factor the trend line, you can also figure out the "gradient" of the margin (different from the standard deviation bell-curve, since it composites all of the factors' standard deviations), such that the outer fringes of the band are, say, 5% likely while the inner areas are far more likely. If you determine that and find a very defined gradient, the green line is indeed "less" likely but still not impossible. Don't quote me on that, though. It's also been a while since I had to worry about this sort of thing in school.

    I think TRWTF is your boss didn't inform the client of this issue from the get-go. I'm sure you guys, being familiar with this stuff, knew beforehand that you can't get very accurate statistics from social data sources, so giving a client some expectation of "good" data during the contract signing is, at best stupid and at worst dishonest and possibly fraudulent. If you guys continue this charade you're at risk for the client canceling the contract altogether and possibly being the recipient of some legal action. Of course, if you hadn't known about this issue before, it's definitely best to inform the client of this right now instead of them discovering it later. They'd likely be more willing to accept the issue if you bring it forward to them rather than lying to them about it and risking a massive eruption later.

    As you know, Internet traffic is a very difficult metric, just by its nature. With spiders, caching, faulty user agents, click spammers, people who insist on doubleclicking links, people who dump their cookies every day, proxies, and a myriad of other issues, it's damn near impossible to obtain good statistics. In the past I've been asked to compile traffic statistics broken down to locations, browser usage, repeat traffic, etc., and I've always told them that I can do that, but they must understand that the traffic won't be nearly 100% accurate. Heck, I've gotten local ads pertaining to a location in an entirely different state because they don't always get my location right. Once I set their expectations, they're usually less belligerent when I show them the stats.



  • @RHuckster said:

    I think TRWTF is your boss didn't inform the client of this issue from the get-go.

    I think TRWTF is that you didn't read the goddamned post. It's not my team building this, it's a completely different team in a completely different city, that we just happened to do a meet-and-greet with since we're working on kind of similar-a-bit stuff.

    Don't imply a team I'm on would ever give something like this to a client.



  • So here's my question: it's been a long, long, long time since I took a statistics course. Am I right, or is "other team's boss" right? 

    Both Are Right, Either One is Right (And the other wrong), and Both Wrong are all potentially correct answers. Without knowing the CAUSE of the error, and the statistical disribution it is impossible to determine which.

    That being the case, I completely agree with "I would never give this to a client"...not becuase it is "wrong", but rather because it is meaningless without the information about the root cause(s) for the error bands.

    [And yes, I do this type of statistics, not on "social" data, but on "industrial" data, and finding out the answers to these questions can range from fairly easy to downright impossilbe, with "painfully difficult" being the most common]



  •  as per cpu wizard, could be rectangle function or root mean square etc.....



  • @Helix said:

     as per cpu wizard, could be rectangle function or root mean square etc.....

    Guys these responses are not useful. Just don't type them.



  • @blakeyrat said:

    @Helix said:

     as per cpu wizard, could be rectangle function or root mean square etc.....

    Guys these responses are not useful are not what BlakeyRat wants to hear, since they dont directly support his point of view. Just don't type them.

    'Nuff Said.



  • @TheCPUWizard said:

    @blakeyrat said:
    @Helix said:

     as per cpu wizard, could be rectangle function or root mean square etc.....

    Guys these responses are not useful are not what BlakeyRat wants to hear, since they dont directly support his point of view. Just don't type them.

    'Nuff Said.

    I'm still trying to figure out who's easier to troll: Blakey or SpectateSwamp.



  • @RHuckster said:

    @TheCPUWizard said:
    @blakeyrat said:
    @Helix said:

     as per cpu wizard, could be rectangle function or root mean square etc.....

    Guys these responses are not useful are not what BlakeyRat wants to hear, since they dont directly support his point of view. Just don't type them.

    'Nuff Said.

    I'm still trying to figure out who's easier to troll: Blakey or SpectateSwamp.

    What definition of the word "troll" are you using in this context?



  • @blakeyrat said:

    @Helix said:

     as per cpu wizard, could be rectangle function or root mean square etc.....

    Guys these responses are not useful. Just don't type them.

     

    http://lmgtfy.com/?q=Uncertainty

     



  • Yes, thank you, I understand the concept of uncertainty exists. That doesn't mean a long-winded post that sums up to "I'm uncertain" is a useful thing to post on this forum.



  •  If you're talking about sentiment analysis, the funny part is that 70% is actually extremely good - even if you compare two humans, they only agree about 70% of the time, so a system that agrees with a particular human at about a 70% rate is as good as a human.



  • @Cat said:

    If you're talking about sentiment analysis, the funny part is that 70% is actually extremely good - even if you compare two humans, they only agree about 70% of the time, so a system that agrees with a particular human at about a 70% rate is as good as a human.

    Yeah, you're spot-on. One of the challenges with data sources like Twitter is, devoid of context, a human can't do much better than a Bayesian classifier. If you hire 20 guys to just look at tweets and rate them, unless they know the people who are sending them (they don't), and they have the context the original tweet appeared in (not just the conversation, but the surrounding tweets), then they have no clue a high proportion of the time.

    Yet Another Reason I would never show "data" like that to a client.



  • What "units" are being used to rate things from -1 to 1? Is it a scale of like/dislike concerning whatever they tweeted about?



  • @lettucemode said:

    What "units" are being used to rate things from -1 to 1? Is it a scale of like/dislike concerning whatever they tweeted about?

    I can't speak for their implementation.

    In my implementation, the human rated the tweet as either 100% positive or 100% negative. The algorithm assigned a more fractional value, like .68 positive, but for checking accuracy I just looked at which size of zero it was on. (Actually I think I set low-scorers like below 0.25 to zero so they were "neutral", but its been a few years so I don't remember for sure.)


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.