The Automated Curse Generator



  • It was 1999, and our new online marketing venture was finally off the ground and making a profit using an off-the-shelf conglomeration of bits and pieces of various content management, affiliate program, and ad servers. We'd hit all

    of the goals for our first funding tranche, and the next step was to use those millions of dollars to grow the staff from 12 to 50, half of which were software developers working directly for me.

    The project was an $8 million, nine-month development effort to build, from the ground up, the best 21st-century marketing/e-commerce/community/ad network/reporting system mousetrap possible. Leading a team of 20 people was a big

    step up, so I buckled down, reading management theory books, re-reading The Mythical Man month, learning the ins and outs of MS Project, Rational Rose, and Requisite Pro, investing in UML and process training, and carefully poring over

    resumes to find the best candidates.

    Having assembled, trained, and indoctrinated my team in current best practice formal software development process, we went to work. We held stakeholder interviews, pored over requirements, developed use case models, charted process

    flow, designed domain entity models, built our development plan. We developed cleanly separated business logic, persistence, and user experience tiers. We followed formal test-driven development. We held weekly group code reviews. And

    slowly but surely we carefully moved forward with development.

    No, the WTF is not this overly formalized, non-Agile, upfront design, big architecture, strictly controlled waterfall development model. This was what was required to ensured that we succeeded, and helped us to hit all of the

    milestones along the way towards our goal. We of course had our share of back-and-forth with business goals vs. quality vs. scalability vs. time constraints, but overall I have never before or since been on a project which ran as

    smoothly as this one.

    Until the last week before release.

    One of the key components was our custom-built ad server, which used a single unique ID for each ad placement, handling switching of creatives all on the backend. At the time this was still uncommon, with many affiliate programs and

    ad servers hardcoding the creative images themselves directly into the HTML ad serving code. Being able to manage updates of creatives, optimize banner rotation, and take down unwanted ads on the back end was one of the major advantages

    as seen by the business users.

    "Hey Brian, quick question." It was Barry, VP of Marketing. The entire company was his brainchild, and he was CEO in all but name. "Some clients are complaining that these IDs are kind of ugly, always

    'F0DB57A3C10EE7D28277' or some other unpronouncable jumble."

    Now, these IDs were uniquely generated each time a new ad placement was created. With dozens of client and hundreds of affiliates signed up representing thousands of web sites and tens of thousands of pages, we needed to generate

    them automatically. To avoid possible fraud, they needed to be non-sequential but unique. And they were purely used on the backend--nobody ever needed to read them out loud, so far as I could see.

    "These are only shown to users as URL parameters, no different than the session ID," I protested. More diplomatically, I asked, "What's the reason to for making them easily readable out loud?"

    "Well," he admitted, "one of our biggest advertisers has legacy accounting systems which their IS department can't or won't integrate with our online reports. When talking by phone with people in different offices, they have to read

    the IDs to each other to be able to identify which accounts they are talking about."

    After thinking a moment, I realized that this was the perfect place to apply an algorithm I had learned about recently. "Markov chains!" I blurted. "We can use statistical textual analysis to generate unique random words built up from natural phonemic combinations. They won't be real words, but they will match expected English patterns, and people will be able to pronounce and read them completely naturally."

    Intruiged, Barry assented, reminding me that the release at the end of the week still had to be met.

    But I was already off, thinking through the design in my head. I grabbed my star developer Shipra, and over the next two days she and I built a corpus analyser to build the necessary statistical models, and the generator to randomly

    string them together and output the "pronounceable IDs". It was a great success. Everyone crowded around to see the server spitting out fake words like "enspattle", "flargleblum", "unclorifical",

    and "macrodestic".

    Barry was ecstatic too, "This is great! That client has been threatening to drop us because reading off those codes is slowing down their operations. They're our most well-known anchor client, so if they go, others will drop with

    them. This is exactly what I we need to keep them. Let's demo it right away."

    The two of us drove up to their offices in the city, and Barry proudly told them that I, standing next to him, was the genius who came up with a way to make readable codes and increase their workers' productivity. He opened up a

    browser to the demo word generator page and clicked "New random word".

    "garglepussy" immediately popped up on the screen.

    After a silent five seconds while the client stared in horror, Barry said "Well, it is random after all. Brian, you can filter that, right?" "Sure, I'll put a bad-word list together," I said, groaning that I hadn't thought of it

    before. We ran through it a few more times, getting nice normal-sounding words like "blutterful", "trimbolid", and "anavastic". We finally left with the client happily and satisfied.

    During the drive back, Barry said, "I've been thinking about it and it's too dangerous to just have a bad-word filter. We'll never be able to think up every possible offensive-sounding combination. Can you make them sound like a

    foreign language instead of English? That way if it does come up with some curse words most people won't even realize it."

    It was a good idea--we already had a corpus analyser, and could plug in just about any text we felt like. "Sure, I'll show you some samples this afternoon," I told him.

    Shipra and I spent the next few hours running the corpus on just about any foreign text we could find. We plugged in "Lorem ipsum dolor" to get some fake Latin, she pasted in some of her personal emails transliterated from Hindi, the

    German libretto of "Die Zauberflöte", some Balzac novels cribbed off of the French Gutenberg Project, the text from some Italian airplane manufacturer's web site, and "Don Quijote".

    Barry stopped by and we tried the samples out on him, one by one.
    Latin: "Everybody pronounces it wrong, differently."
    Hindi: "Too many weird vowels, and it makes me want to slip into an Apu accent."
    German: "All those consonants and throaty sounds are too hard."
    French: "Are you kidding? Most of the letters at the end of words are silent."
    Italian: "Better, but it look me two years to learn to say 'gnocchi' right."
    Spanish: "Easy vowels, simple sounds, best yet! But some of their staff are Hispanic. Too dangerous."

    We all sat there for a few minutes trying to think of something else when Barry cried out, "I've got it!" And ran out of the room. He came back with a Japanese study book. "I'm planning to expand overseas after this release, and

    bought this book to study with since it writes everything in the English alphabet. It's perfect--there's simple vowels, only a few constonants, and no funny sounds to trip you up. Even if people pronounce a little differently it's still easy to figure out. And nobody knows what it means so it can't be offensive!"

    The next day, one day before release, we had finished typing in page after page of meaningless Japanese, we were off once more to demo to the client. Barry carefully clicked "New random word."

    "koremachiko", "sabashimasu", "tobetokaga", "mitsukaremo". The client carefully read each example, and after a few minutes leaned back, chuckling, "That's perfect, Barry. They're a

    little funny, but can be read distinctly, and no chance of offense. Thanks for doing this for us, we're onboard for the launch."

    Barry was ecstatic.

    By the next morning, everyone in the office was in a great mood, too. Launch had gone smoothly, and our realtime reporting showed all of our website activity, ad serving, and commerce transactions ramping up. Everything was working

    perfectly.

    We were in the middle of a celebratory all-hands wine & cheese party in the conference room when Barry got a call and stepped out of the room. Several minutes later he came storming back into the room, waving an email he had printed

    out and yelling, "Brian, Brian, you useless screwup! What the hell is wrong with you? How do you explain this? Read it, OUT LOUD!" He shoved the paper at me, and I took it, apprehensively.

    "fukushita", "moreshite", "fukumiharuda", "youfatsu", "tokaduki", and "fukyusuka", I read, collapsing inwardly and visibly shaking as I read down the list,

    imagining the customers' staff members reading them to each other all day. Some people in the room tittered, earning sharp looks from Barry.

    "They dropped our contract!" Brian shrieked, "Half our revenue is gone! You've killed our company!"

    I started to protest that dropping the bad-word filter and using Japanese were both his idea, but I could see that this would do nothing to abate his fury. The next day I asked for, and was happily granted a two week vacation which I

    used to start looking for a new job.

    Five years later at an industry mixer, I exchanged cards with a developer and saw that he worked in Barry's department at my old company. I told him I had worked there years before and was glad to see the company had recovered and

    was going strong.

    "You worked there too?" he asked, looking at my name tag. He suddenly got a strange look on his face. "Wait, are you THAT Brian, who first developed the business platform?"

    Cautiously, I replied in the affirmative.

    He broke out into a huge smile. "You're famous. You know, we're still using it. We call it The Automated Curse Generator."



  •  Too long; did read.

    +1 style

    +1 funny

    -1 payoff



  • @dhromed said:

     Too long;

     

    Just because teh internetz invented "tl;dr" you shouldn't use it when the text has more than 2 paragraphs.

    Actually you should be ashamed to use it when you're over 21 years old.

    Good story.



  • -0.5 Long but I needed a coffee break article

    +1 well written

    +1 funny

    Front page worthy, thanks for sharing.



  • Definitely front page worthy, with minimal editing.

    I think there was a tpyo or two in there, but I'm too lazy to go search for them again.



  • I think it already has LESS typos than the average front page article.



  • 10/10, Front page.



  • +1 for front page, this one beats many of the latest front page articles.



  • I love Markov chains!

    I fed my company's code of conduct to a second-order Markov chain for fun:

    • Can I participate in the newspaper? No.
    • Ensure that the Company depends on all of our competitors.
    • Our people have been, and will continue to question the rationale for these decisions
    • The goal of any country is improper.
    • To [my company], that means continually working to enhance the quality of our competitors’.


    Also I have another database with the collected Harry Potter novels. Did someone say awful machine-generated fanfiction?

    Harry felt that Davies was watching him run straight through the curtains of his shark's. "What chldish dream is this?" But she seemed to be reading one another's darkening grounds. Harry slipped upstairs to get up here, and he was staring at Harry's hair and beard, upon the castle courtyard, evidently arguing. Meanwhile Harry had not told Ron and Hermione, ordered three butterbeers from beneath the Cloak Harry and Voldemort were sharing had changed her usual knack of dressing like a Muggle, but it is time for you to life and dreams, Potter. Several times to Dumbledore, said a cold, snide voice, and Harry gripped Draco's With fumbling fingers.



  • Wow. That's a great story, and told well.



  • read it, loved it



    +1 FrontPage



  • @Welbog said:

    Harry gripped Draco's With fumbling fingers
    Sounds kinky.



  • Loved your story, but I started thinking about your approach with Markov chains, especially since it generates sequences that are not possible in Japanese (du, fa in your examples).

    Wouldn't it be better to use a simple BNF grammar to generate the random words? I have done something like it and it's much easier than typing in "page after page of Japanese". Something like this:

    word = s s s | s s s s | s s s s s
    s = cv | cv 'n'
    cv = c v
    c = 'k' 's' 't' 'n' 'h' 'm' 'y' 'r' 'w' 'g' 'z' 'd' 'b' 'p'
    v = 'a' 'i' 'u' 'e' 'o'
    

    This also produces invalid Japanese, but it can be extended and also followed by replacing invalid combinations with valid ones. But I'm unsure how many unique combinations can this method produce.



  • @SlyEcho said:

    it generates sequences that are not possible in Japanese (du, fa in your examples).
    These are not entirely unlikely, given that the source was an introductory textbook for English speakers.  There are some systems of romanji that allow du in place of zu.  And fa is not uncommon in introductory texts, especially for loan words (e.g., "sofaa").



  • @Flatline said:

    I think it already has LESS typos than the average front page article.
    That's because it hasn't gone through the TDWTF Anonymizer 2000.  TDWTFA2K inserts random typos and errors to defend against traps like the one Elon Musk attempted at Tesla.  It's for your protection.



  • @SlyEcho said:

    Loved your story, but I started thinking about your approach with Markov chains, especially since it generates sequences that are not possible in Japanese (du, fa in your examples).

    Japanese also has Katakana which is used to pronounce English words and names.
    Which contains fa (ファ) and du (ドゥ).



  • @SlyEcho said:

    Loved your story, but I started thinking about your approach with Markov chains, especially since it generates sequences that are not possible in Japanese (du, fa in your examples).

    That depends on the granularity. If you use bigrams as a basis, it will never generate du or fa. But the whole point was to generate pronounceable non-words, not to generate a Japanese novel, so it's not an issue.

    But, indeed, there was no reason to use a Markov process. If you want to reliably generate strings, any kind of grammar suffices. Normally you use Markov chains to relate the output (or input) to some probability, but in this case it is not desirable to generate strings based on likelihood, since they needed unique identifiers. BTW, your grammar is a finite state automaton, just like a standard Markov chain. It generates (3 + 4 + 5) * 2 * 14 * 5 = 1680 unique strings.

    What I didn't get at all is why the OP left over this. Ok, he didn't test it, and someone said "fuck" out loud (even though in Japanese it sounds like fook, right?) To think that someone cut business because of this alone is absurd. To quit your job because you build what was expected, is absurd too. So I guess not the whole story has been told. 



  • @TGV said:

    What I didn't get at all is why the OP left over this. Ok, he didn't test it, and someone said "fuck" out loud (even though in Japanese it sounds like fook, right?) To think that someone cut business because of this alone is absurd. To quit your job because you build what was expected, is absurd too. So I guess not the whole story has been told. 
    When the biggest client leaves, somebody's head goes up on the block.  I think the goal was to find another job before getting the axe.



  • @bstorer said:

    When the biggest client leaves, somebody's head goes up on the block.  I think the goal was to find another job before getting the axe.
    He commited professional harakiri because his honour was tainted.



  • @XIU said:

    Japanese also has Katakana which is used to pronounce English words and names.
    Which contains fa (ファ) and du (ドゥ).

    Yes, but weren't they trying to avoid English words?

    @TGV said:

    BTW, your grammar is a finite state automaton, just like a standard Markov chain. It generates (3 + 4 + 5) * 2 * 14 * 5 = 1680 unique strings.

    1680 is not a lot, is it? If you want to replace 128-bit GUIDs, you're going to get some looong strings.

    @TGV said:

    What I didn't get at all is why the OP left over this. Ok, he didn't test it, and someone said "fuck" out loud (even though in Japanese it sounds like fook, right?) To think that someone cut business because of this alone is absurd. To quit your job because you build what was expected, is absurd too. So I guess not the whole story has been told.

    Maybe he was stressed out from the pre-launch crunch. Also, 1998 was when the dot-com bubble started so there were probably other high-paying jobs available.

    What I don't get is why they didn't anticipate getting dirty words, in English and in other languages. Any random string is bound to contain recognizable patterns of any language.



  • Err... so, nobody thought of just using numbers?

    164-980-531-541-768-323-285-049 is pretty random, and pretty readable (so long as you keep the dashes), and pretty scalable.

    Also:

    @takatori said:

    "They dropped our contract!" Brian shrieked, "Half our revenue is gone! You've killed our company!"

    "You worked there too?" he asked, looking at my name tag. He suddenly got a strange look on his face. "Wait, are you THAT Brian, who first developed the business platform?"

    So which one was Brian, again?

     



  • @takatori said:

    "fukushita", "moreshite", "fukumiharuda", "youfatsu", "tokaduki", and "fukyusuka"

    Most of these words sound harmless when pronounced correctly. Of course, that's not the case for all possible words, e.g. "sanowabichi" or "fakyubichisu".



  • @derula said:

    @takatori said:
    "fukushita", "moreshite", "fukumiharuda", "youfatsu", "tokaduki", and "fukyusuka"

    Most of these words sound harmless when pronounced correctly. Of course, that's not the case for all possible words, e.g. "sanowabichi" or "fakyubichisu".


    I always got a laugh after the 2004 tsunami when journalists on TV were trying to pronounce Phuket. The all looked like naughty schoolboys who had been caught saying a swear word. Talk about unprofessional.



  • Great story, best in awhile - belongs on the front page.



  • @TGV said:

    BTW, your grammar is a finite state automaton, just like a standard Markov chain. It generates (3 + 4 + 5) * 2 * 14 * 5 = 1680 unique strings.

     

    It generates (2 * 14 * 5) ^3 + (2 * 14 * 5) ^4 + (2 * 14 * 5) ^5  = 54 169 304 000 unique strings.

     v can have 5 different values.

    c can have 14 different values.

    So cv can have 5 * 14 = 90 different values.

    And s can have 90 + 90 = 180 different values.

    s s s can have 180 * 180 * 180 = 180^3 different values (all those s's are independent from each other).

    s s s s and s s s s s can have 180^4 and 180^5 different values.



  •  Insert one digit every 3 letters, ban all 3-letter bad words. Problem solved?



  • @spamcourt said:

    @TGV said:

    BTW, your grammar is a finite state automaton, just like a standard Markov chain. It generates (3 + 4 + 5) * 2 * 14 * 5 = 1680 unique strings.

    s s s can have 180 * 180 * 180 = 180^3 different values (all those s's are independent from each other). 

    My bad. 



  • @SlyEcho said:

    Loved your story, but I started thinking about your approach with Markov chains, especially since it generates sequences that are not possible in Japanese (du, fa in your examples).

    Point taken. Ironically I actually moved to Japan and became fluent enough to work in a Japanese company, so have no excuse. o_o

    This happened more than a few years ago, so I don't remember all of the exact words involved. Let's chalk the remainder up to poetic license. Several of the real ones are absolutely burned into my memory. But all of the English examples in this story are actually from a CAPTCHA generator I wrote last month--the trigger for remembering this old story.

    @SlyEcho said:

    Wouldn't it be better to use a simple BNF grammar to generate the random words?

    That CAPTCHA actually does use something like the BNF grammar you suggest--good call!

    There's a table of starting, continuation, and terminating letter sequences. Also, there's a post-processing filter to make it try a new combination if anything on the N/G wordlist shows up embedded in the result. A different approach this time to the same problem. And, I should say, in many ways a better one.

    +1 paying attention to details
    +1 clever programming idea

    And though uncommon, "fa/ファ" and "du/づ" are possible, though mostly in loanwords: Computer files are "ファイル/fairu"; I often hear"+α/プラスアルファ/purasuarufa" (plus alpha) in business contexts; and "続ける/つづける" a fairly common example.

    The gojuon table is incomplete in terms of phonology, showing only the base kana without voicing marks. http://en.wikipedia.org/wiki/Hiragana has a more complete table. http://en.wikipedia.org/wiki/Japanese_phonology might be interesting.



  • @WhiskeyJack said:

    @takatori said:

    "They dropped our contract!" Brian shrieked, "Half our revenue is gone! You've killed our company!"

    "You worked there too?" he asked, looking at my name tag. He suddenly got a strange look on his face. "Wait, are you THAT Brian, who first developed the business platform?"

    So which one was Brian, again?

    Haha, oops. The shrieker was Barry.


Log in to reply