Regular Expression to look for [b]tags[/b]

Sunday_Ironfoot

I'm trying to write regular expression that looks for special tags such as [ b ]bold[/ b ] (spaces added to prevent formatting) and replaces it with bold. Specifically I'm trying to look for a way to match anything that isn't a specific character sequence so that it starts with [ b ] ends with [ /b ] and contains anything that isn't [ b ]

So far I have...

Regex reBoldTags = new Regex(@"\[b\].+\[\/b\]", RegexOptions.IgnoreCase);

...problem is the .dot matches absolutely everything including other [ b ] tags but I want it to match anything that isn't another [ b ] tag. How do I use negation on specific characters sequences? Cheers!

Sunday_Ironfoot

W00t!!! I think I've solved it!! It turns out I have to use uses lazy evaluation to match as few instances as possible. So rather than match .+ I match .*? eg.

Regex reBoldTags = new Regex(@"\[b\].*?\[\/b\]", RegexOptions.IgnoreCase);

Before the following character sequence "this is in [ b ]bold[ /b ] and is in [ b ]bold[ /b ] also."

Thre regex in the previous post would have matched "[ b ]bold[ /b ] and is in [ b ]bold[ /b ]", whereas lazy evaluation matches "[ b ]bold[ /b ]" and "[ b ]bold[ /b ]" separately. From there it's just a matter of stripping the tags off the ends and adding tags instead. I'm trying to prevent malicious users sticking [ b ] and [ i ] tags in whilly nilly and thus screwing up the formatting of the rest of the page instead of keeping it to their comment/blog post. Here's the code in full...

public static string Bold(string text)
{
	Regex reBoldTags = new Regex(@"\[b\].*?\[\/b\]", RegexOptions.IgnoreCase);
	foreach (Match match in reBoldTags.Matches(text))
	{
		string oldValue = match.Value;
		string value = match.Value;
		value = value.Substring("[ b ]".Length);
		value = value.Substring(0, value.Length - "[ / b ]".Length);

		text = text.Replace(oldValue, "< b >" + value + "< / b >");
	}

	return text;
}

Now all I need is someone to turn this into a WTF! :-)

* If you use this code remember to strip the spaces from [ b ] etc. this form editor thing keeps removing them. How does one insert code snippets into this thing?

Goplat

There's already a WTF there. If the language let the characters in a string be changed you could change [ ] to < > in-place and it would only be an O(n) algorithm, but since you have to make a new copy on each match it potentially becomes O(n^2).

Sunday_Ironfoot

Not sure I follow?

Goplat

Every time you do the "text = text.Replace(...)" it makes a copy of the entire text. This is very slow if there are a lot of bolds; if someone sends a message consisting of 100000 [[b][/b]b][/b] in a row it will take 274 seconds to process - major vulnerability for DoS attacks.

I'm not a .NET guy but here's my shot at making a faster version. It uses StringBuilder so that the string only gets copied twice, and it takes 0.13 seconds on the aforementioned string.

public static string Bold2(string text)
{
 StringBuilder newText = new StringBuilder(text.Length);
 int lastEnd = 0;
 Regex reBoldTags = new Regex(@"\[b\].*?\[\/b\]", RegexOptions.IgnoreCase);
 for (Match match = reBoldTags.Match(text); match.Success; match = match.NextMatch())
 {
 newText.Append(text.Substring(lastEnd, match.Index - lastEnd));
 string value = match.Value;
 value = value.Substring("[b[b][/b]]".Length, value.Length - "[b[b][/b]][/b]".Length);
 newText.Append("");
 newText.Append(value);
 newText.Append("");
 lastEnd = match.Index + match.Length;
 }

 return newText.ToString();
}

djork

@Goplat said:

public static string Bold2(string text)
{
 StringBuilder newText = new StringBuilder(text.Length);
 int lastEnd = 0;
 Regex reBoldTags = new Regex(@"[b].*?[/b]", RegexOptions.IgnoreCase);
 for (Match match = reBoldTags.Match(text); match.Success; match = match.NextMatch())
 {
 newText.Append(text.Substring(lastEnd, match.Index - lastEnd));
 string value = match.Value;
 value = value.Substring("[b[b][/b]]".Length, value.Length - "[b[b][/b]][/b]".Length);
 newText.Append("");
 newText.Append(value);
 newText.Append("");
 lastEnd = match.Index + match.Length;
 }

 return newText.ToString();
}

That's an awful lot of code. How about a variation on this:

Regex.Replace(inputString, "[b](.*?)[/b]", "$1");

Use a case-insensitive options and make it global, and voila!

djork

Mmmkay, real .NET C# version here:

static string ReplaceBBCode(string input)
{
    return Regex.Replace(input, "\\[b\\](.*?)\\[\\/b\\]", "<b>$1</b>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
}

Tested and working.

Sunday_Ironfoot

[quote user="djork"]Mmmkay, real .NET C# version here:

static string ReplaceBBCode(string input)
{
    return Regex.Replace(input, "\\[b\\](.*?)\\[\\/b\\]", "<b>$1</b>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
}

Tested and working.[/quote]

I could kiss you!! Thank you very much! :-)

BTW BB Code stands for Bulleting Board correct? I know it's used in bulletin boards and forums a lot, it would give me a handy name for my class that does all this BBCode stuff.

djork

@Sunday Ironfoot said:

[quote user="djork"]Mmmkay, real .NET C# version here:

static string ReplaceBBCode(string input)
{
    return Regex.Replace(input, "\\[b\\](.*?)\\[\\/b\\]", "<b>$1</b>", RegexOptions.IgnoreCase | RegexOptions.Multiline);
}

Tested and working.

I could kiss you!! Thank you very much! :-)

BTW BB Code stands for Bulleting Board correct? I know it's used in bulletin boards and forums a lot, it would give me a handy name for my class that does all this BBCode stuff.

[/quote]

You're welcome.

RiX0R

Note that this won't work correctly with nested tags. That might not be a problem with [b] tags because people don't usally nest them (though it [ b ]will look [ b ]like this[ /b ] and that's not what's intended[ /b ]), but it becomes a serious problem once you start introducing tags like [ quote ].

The only solution in that case is to write yourself a stack-based parser.

</pooper>

HitScan

[quote user="RiX0R"]

Note that this won't work correctly with nested tags. That might not be a problem with [b] tags because people don't usally nest them (though it [ b ]will look [ b ]like this[ /b ] and that's not what's intended[ /b ]), but it becomes a serious problem once you start introducing tags like [ quote ].

The only solution in that case is to write yourself a stack-based parser.

</pooper>

[/quote]

You could also pull a crappy "close enough" version by comparing the number of matches of opening and closing tags, replacing only as many tags as the lower count. I did that once and it turned out alright, though it's probably terribly inefficient.

Avenger

Here is a simplified version of the one I use, its usefulness becomes more apparent when dealing with tags such as [ quote ].

<?php

function processBBCodeInline($in) {

if (is_array($in)) {

$in = '<' . $in[1] . '>' . $in[2]

. '</' . $in[1] . '>';

}

return preg_replace_callback(

'@[(b|i|u|s)]((?:[^[]|[(?!/?\1])|(?R))+)[/\1]@Si',

'processBBCodeInline',

$in

);

}

echo processBBCodeInline('[b]Test![/b]');

?>