Vocaloid file format (VSQ) - MIDI done wrong



  • I recently discovered the most WTF file format ever: Vocaloid's VSQ file format.

    To give you a glimpse on how it works, here is a section of such a file, as decoded by some Perl module (so the events you see map directly to MIDI events):

                                                                                
    MIDI::Opus->new({                                                                
      'format' => 1,                                                                 
      'ticks'  => 480,                                                               
      'tracks' => [   # 2 tracks...                                                  
                                                                                     
        # Track #0 ...                                                               
        MIDI::Track->new({                                                           
          'type' => 'MTrk',                                                          
          'events' => [  # 4 events.                                                 
            ['track_name', 0, 'Master Track'],                                       
            ['set_tempo', 0, 500000],                                                
            ['time_signature', 0, 4, 2, 24, 8],                                      
            ['set_tempo', 7765, 674157],                                             
          ]                                                                          
        }),                                                                          
    

    Okay, until here everything is sane. The typical mostly empty track 0 that only
    contains a name and global metadata (tempo, etc.).

                                                                                
        # Track #1 ...                                                               
        MIDI::Track->new({                                                           
          'type' => 'MTrk',                                                          
          'events' => [  # 25902 events.                                             
            ['track_name', 0, 'Voice1'],                                             
            ['text_event', 0, "DM:0000:[Common]\x0aVersion=DSB301\x0aName=Voice1\x0aColor=181,162,123\x0aDynamicsMode=1\x0aPlayMode=1\x0a[Master]\x0aPreMeasure=4\x0a[Mixer]\x0aMasterFed"],
            ['text_event', 0, "DM:0001:er=0\x0aMasterPanpot=0\x0aMasterMute=0\x0aOutputMode=0\x0aTracks=1\x0aFeder0=0\x0aPanpot0=0\x0aMute0=0\x0aSolo0=0\x0a[EventList]\x0a0=ID#0000\x0a7680=ID"],
    ...                                                                              
    

    What is THIS? An encoded INI file? YES! And all the events of which the INI
    file consists have a delta time of zero, and occur at the start of the file.

    The INI file is split into text events of fixed length, and not e.g. into lines. WTF?

                                                                                
            ['control_change', 0, 0, 99, 96],                                        
            ['control_change', 0, 0, 98, 0],                                         
            ['control_change', 0, 0, 6, 0],                                          
            ['control_change', 0, 0, 38, 0],                                         
            ['control_change', 0, 0, 98, 1],                                         
            ['control_change', 0, 0, 6, 0],                                          
            ['control_change', 0, 0, 38, 0],                                         
            ['control_change', 0, 0, 98, 2],                                         
            ['control_change', 0, 0, 6, 0],                                          
            ['control_change', 0, 0, 99, 83],                                        
            ['control_change', 0, 0, 98, 2],                                         
            ['control_change', 0, 0, 6, 1],                                          
            ['control_change', 5760, 0, 99, 96],                                     
    ...                                                                              
    

    That's right. All there is is control changes. Everywhere. No notes, no lyrics,
    no nothing. Now the format would be even more WTF if these events would somehow
    encode the phonemes and the note pitches... but AT LEAST they don't do that.

    But if you look closely, those control changes are a bit redundant... you see
    multiple control changes for the same controls at the same time! But this is
    just a General MIDI WTF (see Registered Parameters, "RPN"). Apparently, in
    Vocaloid, 98 and 99 (in GM, 100 and 101) select a controller, and 6 and 38
    write MSB and LSB into it (like in GM). But, these are just the controllers you
    can set in the application, so this part can be considered sane. We however
    know the notes and lyrics are NOT encoded with these!

                                                                                
            ['text_event', 186720, ''],                                              
          ]                                                                          
        }),                                                                          
    ····                                                                             
      ]                                                                              
    });                                                                              
    

    And this part is normal again.

    Ok... and what's in the INI file? Why, the notes of course! Along with timing info!

    Let's see:

                                                                                
    [Common]                                                                         
    Version=DSB301                                                                   
    Name=Voice1                                                                      
    Color=181,162,123                                                                
    DynamicsMode=1                                                                   
    PlayMode=1                                                                       
    [Master]                                                                         
    PreMeasure=4                                                                     
    [Mixer]                                                                          
    MasterFeder=0                                                                    
    MasterPanpot=0                                                                   
    MasterMute=0                                                                     
    OutputMode=0                                                                     
    Tracks=1                                                                         
    Feder0=0                                                                         
    Panpot0=0                                                                        
    Mute0=0                                                                          
    Solo0=0                                                                          
    

    Okay, a "Feder" is probably a Fader improperly spelled. Comma separated values
    in INI are a bit odd, but for a color it's nothing too weird.

                                                                                
    [EventList]                                                                      
    0=ID#0000                                                                        
    7680=ID#0001                                                                     
    7920=ID#0002                                                                     
    8040=ID#0003                                                                     
    8160=ID#0004                                                                     
    ...                                                                              
    

    Apparently, these are a mapping from timestamp to event. This BTW means there
    can only be one event at a given timestamp if this is to be a standard INI
    file! Apparently, Vocaloid files always fulfill that, though. What is with this
    ID# stuff?

                                                                                
    [ID#0000]                                                                        
    Type=Singer                                                                      
    IconHandle=h#0000                                                                
    ...                                                                              
    

    Ah, so the ID# stuff is the section name to look in. Okay, this is an event,
    somewhat like a MIDI patch change event. h#0000 is of course also the name of a
    MIDI section, but we're going for the notes here.

                                                                                
    [ID#0002]                                                                        
    Type=Anote                                                                       
    Length=120                                                                       
    Note#=64                                                                         
    Dynamics=64                                                                      
    PMBendDepth=8                                                                    
    PMBendLength=0                                                                   
    PMbPortamentoUse=0                                                               
    DEMdecGainRate=50                                                                
    DEMaccent=50                                                                     
    LyricHandle=h#0002                                                               
    ...                                                                              
    

    And this is how a note is stored. Both note-on and note-off in one, fine,
    that's okay. It contains the note pitch, the length and all sorts of other
    nice info - but not the lyrics. These - again - are in another castle, I
    mean, INI section:

                                                                                
    [h#0002]                                                                         
    L0="a","a",0.000000,0,0                                                          
    ...                                                                              
    [h#0005]                                                                         
    L0="chu","tS M",0.000000,64,0,0                                                  
    

    WTF? WTF? WTF?

    So the lyrics handle is just an indirection to a comma separated, with quotation marks, encoded single INI value. Why aren't these five named values, you ask? No idea!

    The first entry are the lyrics, the second one the phonemes in a weird ASCII encoding, and the rest are some parameters.

    So to conclude: a VSQ file is a MIDI file with just timed controller events, along with an embedded INI file at the beginning of the data track. The actual song (notes, lyrics) is encoded in this INI file, and NOT the MIDI data...

    So the final question is: what were they smoking?

    And as for the "competition"... the UTAU format (UST) is a lot less WTFy. It is just an INI file, no MIDI involved there. To summarize the UTAU WTFs quickly without showing a file: each event is a lyric event, and has a length. There are no delta times - the events all form a single string from start to end. Overlaps are right out. How to do a rest with this? Simple! Just set the lyric text to a single uppercase "R". The other WTF is that when multiple tracks are contained, INI section names are repeated, but apparently UTAU users don't do that anyway...



  • @OperatorBastardusInfernalis said:

    Why aren't these five named values,
    you ask? No idea!

    Efficiency, of course.

    (Also, WTF is up with the source of your post? Does your space bar stick or something?)



  • @pkmnfrk said:

    (Also, WTF is up with the source of your post? Does your space bar stick or something?)

    My guess: he wrote his message to a text file to avoid community server from eating it, and then copypasted it from an 80-column terminal which does not do linefeed optimization for the copied text.



  • And I thought TRWTF is the page element with the id "ctl00_ctl00_bcr_bcr_ctl00_PostList_ctl02_ctl23_ctl00_AllTags"...

    But yes, your guess was right, urxvt messed this one up.



  • @OperatorBastardusInfernalis said:

    And I thought TRWTF is the page element with the id "ctl00_ctl00_bcr_bcr_ctl00_PostList_ctl02_ctl23_ctl00_AllTags"...

    Oh, that's just WebForms, which as everyone knows is a mature, respectable, professional framework for developing for the web.



  • Putting an ini file in a MIDI track is great WTF'ery, I agree. And to put the time dependent data in the INI file with separate time stamps is the icing on the cake. Perhaps their reason to use a MIDI track is that you can drag and drop them in most DAWs from the plugin onto a track, and that will allow the user to copy/paste their work, export it, import it in other projects, etc. That would be defensible.

    About the weird ASCII coding: that's probably a mapping like this one: http://alt-usage-english.org/ipa/ascii_ipa_combined.shtml with the syllable split over the coda.The M is a bid odd though. Perhaps it's can't transcribe "a chu".

    Bless you.



  • Well... using a serialization format, be it INI, XML or JSON, to store structured information in a single MIDI event makes sense, and for e.g. associating lyrics with phonemes it would even be a good idea in Vocaloid's case.

    And yes, the reason for using MIDI format as base is indeed for interoperability with DAWs. Vocaloid BTW also has an "export to MIDI" option, which however does the exact same as "Save as", except that it offers the .mid file extension instead of .vsq. Otherwise, the format is exactly the same... which is another WTF. Vocaloid 3 BTW "fixes" this by offering VSQ, MID (again two same outputs, different extension) but also VSQX (an XML based VSQ format - no MIDI at all - I have not yet looked into). Vocaloid also comes with a VST plugin that can understand the MIDI events from the VSQ/MID file streamed by a DAW.

    However, DAW interoperability is still quite limited by the WTFery they did in the format. E.g. you can't even shift around stuff in your DAW, like, insert a rest somewhere, because that only adjusts the timestamps of the controller events, but not of the contained INI file. In fact, you can't even SEE the notes in your DAW...

    If they had done it correctly, the format would encode:

    - meta events (controllers, or text events with special encoding, be it INI, XML, JSON, whatever) to set singer etc. at the beginning of the song, similar to MIDI's patch_change event

    - for each note, a note-on event, as well as a structured text event containing lyrics, phonemes, and all other important parameters; also, a note-off event

    - for realtime rendering of the note in a DAW, the length of the note and the next phoneme may be needed in advance. Therefore, I suggest that the note-on event's text event (stored at the same timestamp as the note-on) also contains the next phoneme and the length of the note, maybe also of the rest following the note. The note-off event then can be fully ignored by the VST plugin, but should still be generated for interoperability.

    So:

    'text_event', 0, 'VSQ:
      [Singer]
      Voice=Miku
    '
    'text_event', 480, 'VSQ:
      [Anote]
      Channel=3
      Lyrics=Do
      Phonemes=d o
      [LookAhead]
      Duration=220
      Rest=20
      NextPhonemes=p\ a
      NextPitch=65
      NextVelocity=127
    '
    'note_on', 0, 3, 60, 100
    'note_off', 220, 3, 60, 0
    'text_event', 20, 'VSQ:
      [Anote]
      Channel=3
      Lyrics=Fa
      Phonemes=p\ a
      [LookAhead]
      Duration=240
      Rest=0
      NextPhonemes=
      NextPitch=
      NextVelocity=
    '
    'note_on', 0, 3, 65, 127
    'note_off', 240, 3, 65, 0
    

    BTW, this LookAhead section then would only be needed for MID export (for DAWs to use with the VST plugin), for "offline" use of the file (editing, offline rendering), it can be generated from the following events at runtime.

    The ASCII encoding of phonemes is not a WTF, in my opinion. Yes, some things are weird, e.g. that M is used for an [u]-like sound, and that 4 is used for [r] according to documentation, [l] in reality... and what makes this worse is that 4 is used because that's one of the seven breathing sound phonemes (1 to 6 as well as *), and [r] somewhat sounds like breathing... too bad [l] doesn't sound like breathing at all :P And p\ is used for f, which is derived from Japanese, because p and f/h sounds use the same hiragana, just with different "accents".


  • :belt_onion:

    @MiffTheFox said:

    @OperatorBastardusInfernalis said:
    And I thought TRWTF is the page element with the id "ctl00_ctl00_bcr_bcr_ctl00_PostList_ctl02_ctl23_ctl00_AllTags"...

    Oh, that's just WebForms, which as everyone knows is a mature, respectable, professional framework for developing for the web.

    It's the default behaviour but like in any mature framework, you can override this and provide a custom "id generation" algorithm. For a client complaining about ids taking up too much space in the generated HTML, I had to do this. In my implementation, the parent control would generate shorter ids for its children by overriding one of the base control methods, so you would end up with "c0_c0_bcr_bcr_c0_PostList_c2_c23_c0_AllTags" instead of the above. Much better!

     



  • @bjolling said:

    For a client complaining about ids taking up too much space in the generated HTML,

    Why would a client even be looking at that? I'd respond with a lecture about the meaning of the phrase "implementation detail" and tell them to fuck off unless they have actual benchmarks proving the HTML file size is an issue.

    I don't deal with clients often.



  • I was actually referring to the CONTENT of the element - all tag names ever used here on the forums...


  • :belt_onion:

    @blakeyrat said:

    @bjolling said:
    For a client complaining about ids taking up too much space in the generated HTML,

    Why would a client even be looking at that? I'd respond with a lecture about the meaning of the phrase "implementation detail" and tell them to fuck off unless they have actual benchmarks proving the HTML file size is an issue.

    I don't deal with clients often.

    It's probably worthy of a new thread but I'm lazy. Page size was indeed a problem and I was asked to squeeze every last bit out of it by optimizing javascript, reducing ViewState wherever possible and some other stuff that I forgot. Reducing the size of the ids was one of the quick wins. In the end it turned out that the client wanted me to do all this because they thought their  (reverse) proxy couldn't handle HTTP compression because it only supported HTTP 1.0. So instead of me doing all this work we told the client to upgrade their reverse proxy and fuck off :)


    This probably explains it better than I can: http://en.wikipedia.org/wiki/HTTP_compression#Problems_preventing_the_use_of_HTTP_compression

     

     


Log in to reply