Cancelling out speaker audio from microphone input

Gąska

I want to filter out the sound from my speakers in the microphone input so I can, for example, have an online meeting without a headset, and without furiously toggling the mute button. My very brief and not at all thorough research in form of googling shows that there's no ready-made software capable of that. So I plan on coding it myself.

I have zero experience with audio processing, but IIRC some other forumers have. So I'm asking for advice. My idea is to set up a virtual audio input that would capture the speaker output (on Windows, so the Stereo Mix input device) and the input from the real microphone, find the speaker audio in the mic line to calculate volume, transform the Stereo Mix line according to said volume and to calibration data (to account for imperfections in speaker and microphone), subtract the transformed Stereo Mix from the mic line and output it.

The questions I have:

How realistic is this whole thing? Can this kind of processing be done in real time? A little latency is fine (on the order of 200-500ms).
On a scale 1 to 10, given cheap speakers and a cheap microphone, how accurate the calibration step can be?
Is subtraction the right thing to do, or is there a better method? The problem is that the desired input will be mostly human speech, but the undesired noise from speakers will also be mostly speech.
Any recommendations for real-time audio processing libraries?
Any recommended readings?
Audacity has two kinda sorta related features. One is noise removal, which takes a small sample of noise and removes it from the whole audio track. The other is vocal reduction and isolation, which separates voice from music, but it works on a single track without additional data besides a few sliders for parameters. And of course neither is real time. Is it worth looking into how they implemented it or will it be waste of time?
Anything else I should be aware of?
Overall, how much pain should I expect? (I already know. )

Gąska

This video gives me hope.

https://www.youtube.com/watch?v=vNvJKBg3yds

Again, not real time, and also it's super-high quality with perfect sync - pretty much the opposite of what I'd have to work with.

Tsaukpaetra

@Gąska said in Cancelling out speaker audio from microphone input:

How realistic is this whole thing? Can this kind of processing be done in real time? A little latency is fine (on the order of 200-500ms).

You're guaranteed at least 100ms latency on the recording side by default, because Windows sucks, in my experience. You need to go way out of your way to get low-latency recording, and most people don't realize how delayed things are. For an example of this, silly simply turn on the "listen to this device" option in the sound control panel. That's supposed to be straight throughout throughput.

@Gąska said in Cancelling out speaker audio from microphone input:

On a scale 1 to 10, given cheap speakers and a cheap microphone, how accurate the calibration step can be?

Eh. Depends on your algorithm and how accurate you want it. Higher accuracy means more processing and this greater delay in the final output.

@Gąska said in Cancelling out speaker audio from microphone input:

Is subtraction the right thing to do, or is there a better method? The problem is that the desired input will be mostly human speech, but the undesired noise from speakers will also be mostly speech.

You're likely to find out how badly artifacting is. Probably need something more like (as I would describe it) an adaptive dynamic noise cancellation filter seeded.

@Gąska said in Cancelling out speaker audio from microphone input:

Any recommendations for real-time audio processing libraries?

Have you tried RTX Voice?

@Gąska said in Cancelling out speaker audio from microphone input:

Audacity has two kinda sorta related features. One is noise removal, which takes a small sample of noise and removes it from the whole audio track. The other is vocal reduction and isolation, which separates voice from music, but it works on a single track without additional data besides a few sliders for parameters. And of course neither is real time. Is it worth looking into how they implemented it or will it be waste of time?

One effectively filters out on the whole based on a noise profile, and the other probably does similar but more tuned. Technically it could be done in real time (with buffering) once you've established your filter parameters (assuming the noise doesn't ever change), but that's only the beginning of what you want to be doing. Learning how it's implemented will help introduce you to the algorithms used to generate filter logic, so, maybe?

@Gąska said in Cancelling out speaker audio from microphone input:

Any recommended readings?

A quick Google lead me to this free resource:

Chapter 1 – Digital Sound & Music

Site is apparently a bit broken but should get you started learning the concepts you'll be programming toward.

@Gąska said in Cancelling out speaker audio from microphone input:

Overall, how much pain should I expect? (I already know

You better be ready for ....

Gąska

@Tsaukpaetra said in Cancelling out speaker audio from microphone input:

@Gąska said in Cancelling out speaker audio from microphone input:

Any recommendations for real-time audio processing libraries?

Have you tried RTX Voice?

The GPU shortage still didn't end, so no.

aitap

My very brief and not at all thorough research in form of googling shows that there's no ready-made software capable of that.

I think that the Windows 7 sound card drivers in my Lenovo X220 have something like that built-in. I think I've seen it in sound settings elsewhere, too.

I remember reading about how it's done (I don't think I saved any links, sorry). Later I found out it's been implemented in PulseAudio and Mumble, most likely via libspeexdsp. Maybe that could save you some work.

loopback0

@Gąska said in Cancelling out speaker audio from microphone input:

@Tsaukpaetra said in Cancelling out speaker audio from microphone input:

@Gąska said in Cancelling out speaker audio from microphone input:

Any recommendations for real-time audio processing libraries?

Have you tried RTX Voice?

The GPU shortage still didn't end, so no.

It doesn't need an RTX card, it claims to work with any GeForce card.

There are also alternatives like Krisp which do similar without needing an Nvidia card, although Krisp is only free for a limited time each week.

edit: also several of the meeting apps have noise suppression built in which works well IME for this

cvi

@Gąska said in Cancelling out speaker audio from microphone input:

Overall, how much pain should I expect? (I already know. )

Probably this. To me, subtracting signals out like you describe is one of those things that seems simple in theory, but difficult in reality. Tons of tweaking. Difficult to make reliable.

My suggestion would be to start "offline" with two recordings, the original and one that you've recorded through the mic, speaking over it. See what you can do with those, that way you can mostly avoid dealing with various audio APIs -- raw PCM data is relatively easy to work with.

If you have an imperfect sync / miss the timing by very little, you'll introduce a new signal (depends on how much you're off, but it'll range from cracks and pops to an echo). Getting (just) the amplitude wrong is probably less bad, it'll just mean you have a weaker version of the original (possibly with opposite phase, if you subtracted out too much). But, yeah, getting 90% of the time right, might still mean that 10% of the time, people on the other end will be ripping their headphones off.

That said, I'm actually curious how far you could get with that approach.

In terms of classical signal processing, cross-correlation might let you figure out the delay (and amplitude?). I'd guess more signal = better results = longer delay.

You might be able to figure out a transfer function between speaker output and microphone input in frequency space. In theory, one could model (measure? heh) that with an impulse response. In practice ... not quite that easy (limitations to what kind of impulses you can generate). Sweeping through frequencies for calibration is probably easier.

Edit: That said, there are probably specialized methods for specifically isolating voice. I'm sure there's something online. Not sure how well those work with e.g. vocals in the music.

GOG

Okay. This is a bit up my alley.

For your purposes there's essentially three approaches to the problem.

Gating - you set up mic input to cut out all sounds that do not exceed a certain threshold. It's possible that the meeting software you use already has that built in (Discord does, for example). If you're going that way, you want the speaker volume as low as is reasonable, your microphone as close to your mouth as is reasonable, and the microphone position such that the speakers are in the area of least response (for a typical cardioid microphone, this will be behind it). This is what I would try first.
Bandwidth filtering - you can get rid of some background noise outside of the normal speech bandwidth by using high and lowpass filters around the frequency range of your speech. For this sort of application, it's probably not gonna do much, but combined with gating it may help reduce some extraneous room sound.
Phase cancellation - where you take the signal going to the speakers, flip it 180º out of phase, and combine it with the signal coming in through the mic. Combining two identical waveforms that are 180º out of phase will have them cancel out into silence. That's how active noise control works.

The only problem with phase cancellation is that it works only as well as the noise waveforms match. When trying to cancel out speaker signals, you will need to delay your cancellation source to match the signal coming through your mic chain, you will need to adjust the amplitude, and you will probably need to tweak the EQ as well, to match what your speakers and mic are doing. It's worth keeping in mind that you can only get an approximation, in any case, because your mic won't only be picking up the direct sound from the speakers, but also the room sound.

It is absolutely doable, but it will be a pain. For a quick test, you can grab a pink noise file off the net, chuck it into Audacity, and play it on your setup whilst simultaneously recording your mic input. Next, compare the two waveforms. Finally, move your test signal file in the timeline to match peaks with your recorded input, invert phase and see what that gets you. This will be your baseline easy reduction, nothing but delay/invert. While you're at it, you can do the same thing but also record yourself speaking over it, to hear how noise reduction affects the tone of your voice (and it will).

GOG

Double posting for bump. There is a fourth way: ducking.

The way ducking works is that the gain on one signal is controlled by the amplitude of a second signal. In this case, you would have gain reduction set up on the microphone when the signal on your speaker output exceeds some threshold (i.e. when someone else is speaking). Yes, this means they get to talk over you.

Having said all of the above, however (and mentioned it to the wife, who also mucks about with audio), the absolute best thing you can do is simply: use a headset (or just headphones and a mic).

Any sort of spillover mitigation you do will only be partially successful, will degrade the quality of your signal (i.e. what you're saying) and will probably be annoying to listen to on the other end. Since you're gonna be doing this live, you don't have the benefit of tweaking it on the fly and you won't know what the results are, since you won't hear them on your end.

Prolly not worth it, mate.

Gąska

@GOG said in Cancelling out speaker audio from microphone input:

Double posting for bump. There is a fourth way: ducking.

Oh, I already planned on ducking up a lot

Having said all of the above, however (and mentioned it to the wife, who also mucks about with audio), the absolute best thing you can do is simply: use a headset (or just headphones and a mic).

I've got an ear infection and a doctor's recommendation of avoiding headphones of any kind. That was my primary motivation. I know I'll get back to health long before I even have a prototype, but still - wouldn'it be cool not to have to use a headset?

But anyway thanks for all the tips.

Tsaukpaetra

@Gąska said in Cancelling out speaker audio from microphone input:

headphones of any kind

I've had pretty good success with bone induction types. Maybe they could work for you too?

I don't have an ear infection, just they get all inflamed when wearing headphones or earbuds.

Gąska

@Tsaukpaetra

Conduction, not induction. You don't want any induction inside your ear.
Still a skin contact.

GOG

@Gąska said in Cancelling out speaker audio from microphone input:

I've got an ear infection and a doctor's recommendation of avoiding headphones of any kind.

Oh, gotcha. That does change things considerably.

Honestly, the best (meaning, least annoying to everyone else involved) would probably be to have the mic positioned with speakers in the dead zone (for cardioid response that's directly behind, hyper-cardioid behind and to the sides) and push-to-talk (possibly with a bit of gating as well). That way you won't be broadcasting unless you intend to, and there won't be any weird comb filtering that's bound to happen if you try noise cancellation (since you'll never get the cancellation wave perfect, there are bound to be some nasty frequencies that get amplified every now and again and annoy the hell out of anyone listening).

Gribnit

@Gąska as indicated already latency is the problem. Noise-cancelling circuits in headphones are based on fast local buffers. You will want a hardware solution, sadly but most likely, although a true real-time machine could maybe also handle this. Arduino or similar platform with some special DSP parts seems indicated.

ixvedeusi

The term you're looking for is echo cancellation; and judging by the mixed experiences I've had with it I'm thinking it's more difficult to get right than it might seem.

Apart from the delay there's the problem that the path from the DAC through the speakers and the microphone to back the ADC will distort the signal, so just subtracting the audio output from the mic input will probably not produce the desired results even if you get the overall delay right.

acrow

@Gąska Unless you're planning to make this into a hobby, it'll be a lot of pain for no real gain.

My advice? Get a directional microphone, and a pre-amp with auto-squelch.