I'm making a simple Code Library, and though someone on this forum in VB6 land made one that looks sharp, I thought this would be a fun project, and one I could tailor to fit my needs. One of the things I wanted to do was syntax hilighting on the displayed files.
Here is my method as of now, which does work well, but is a bit slow when reading large files.
The lowdown - I have two files, (Which I'll attatch with in the zip, and will have to be pointed at by the program should you decide to run it and help me ) - The form load event tells the program where they are. They are read into two ArrayLists, and then, upon choosing "New" from the file menu, a chosen C# or VB.NET file is loaded into the RichTextBox, and then it's parsed by the Hilight method using RichTextBox.Find(). Like I said, It does work, but if I load a larger file, it takes a few seconds to parse, and this would be annoying given the ultimate layout of the program (i.e, the typical "Code Library" you've seen in the past - Treeview on left with snippets, click puts them into RichTextBox)
Can anyone point me in the right direction, or just give me general advice on how to speed this program up a bit? Originally, I was splitting the lines as read in and manipulating the colors with the RichTextBox, but I was having some trouble putting them back together, and missing some of the hilighting (i.e, when I had things close together, it wouldn't split them right [int i=new int; for example - didn't split because i=new looked like one word]
This method seems to work without changing any of the coding, but it is a bit slow. Again, I'd much rather have general advice or psuedocode, as this is mostly an exercise in learning c#...
The i=new problem you can solve by defining operators (as does Notepad++). The operators are treated as word delimiters.
What you could do is read the text into a string, split on spaces so that you have an array of strings, and then split each element of that array by operators. Either that, or just split the whole text by spaces and operators. This will probably require a custom splitting function. After that you can join it again into a single string, also requiring a custom function to deal with the operators.
Another way, and maybe quicker, would be to read the string into a char array and parse it character-by-character, checking each char against space and a list of operators, and then on each occurence of them, reading back to the last occurence and checking the word in between against a list of keywords.
A state machine is essentially an abstract way of designing your algorithm. Once you have your states defined, you then implement that in a class. Your class will just read in each character at a time and process the data accordingly.
Off the top of my head, you can really have 3 states:
In Normal Code
In Quotes
In Comments
When you are in Normal Code, you look for keywords (probably a binary search on an sorted arraylist) and color them accordingly.
When you are in Quotes, you ignore all keyword and comment characters
When you are in comments, you ignore all keyword and quote characters
You transition in and out of states based on what you read in, so if you read a " in, when you are in Normal Mode, you change from Normal code to In Quotes. You go back to normal mode when you hit the next ".
Same thing with Comments, if you are in normal mode and you read a // or a /*, you change from Normal Mode to In Comments. While in comments you read until a line break or a */ and then you return back to normal mode.
Also another way to speed up your code would be to format the code that is currently displayed and then update the screen. Then have a threaded process formatting the rest of the code behind the scenes.
That's really it, but if you get a CS degree, they usually make you take at least one class on it. They do teach you how to optimize your statemachines and teach you the mathematical model that lies behind them.
Ah, That's what I was guessing. So basically three flags, that control what's being done to the strings as they are being read in, and act accordingly. Well, I'll look into if that's any faster.
Well, I did this, and while it too works, it's still pretty slow..
Code:
private int formatState=0; //0 = Normal, 1 = Syntax, 2 = Comment, 3 = String
private void FormatString(string stri) //This will set above state, and actually insert into txtbox.
{
//The idea is to parse the strings only if necessary.
//first thing to do is to think about the priority of hilighting
//Priority 1 is quotes, because even the block commenter is overruled by quotes
//Priority 2 is commenter
//priority 3 is syntax
codeRTB.SelectionColor =Color.Black;
stri = parseSyntax(stri);
for (int i=0;i<stri.Length;i++)
{
if (formatState==0)
{
if (stri.ToCharArray()[i] == (char)34)
{
formatState=3;
codeRTB.SelectionColor=Color.Red;
codeRTB.SelectedText="\"";
}
if(i+1< stri.Length)
if (stri.Substring(i,2) == "//") //If we aren't in string mode,
//and we see this, then comment out the line and stop searching
{
codeRTB.SelectionColor=Color.Green;
stri = stri.Substring(i,stri.Length-i);
stri = stri.Replace(((char)225).ToString(),""); //Kill any format leftover.
codeRTB.SelectedText=stri;
break;
}
if (stri.ToCharArray()[i] == (char)225)
{
formatState=1;
codeRTB.SelectionColor=Color.Blue;
}
}
else if (formatState==3)
{
if (stri.ToCharArray()[i] == (char)34)
{
formatState=0;
codeRTB.SelectedText="\"";
codeRTB.SelectionColor = Color.Black;
}
}
else if (formatState==1)
{
if (stri.ToCharArray()[i] == (char)225)
{
formatState=0;
codeRTB.SelectionColor=Color.Black;
}
}
if (stri.ToCharArray()[i] != (char)34)
if (stri.ToCharArray()[i] != (char)225)
codeRTB.SelectedText = stri.Substring(i,1);
}
codeRTB.SelectedText=Environment.NewLine ;
}
private char[] separators=(" =-+/*.:<>/;[]{}\t()\"#&").ToCharArray();
private ArrayList sepList = new ArrayList();
private string parseSyntax(string str)
{
//For test purposes, let's assum this string: private void parseSyntax=2
int posinstr=0;
int parseinstr=0;
string word="";
string primary="";
if (str.Length==0) return "";
str +=" ";
do
{
//see if we've found a separator
char curchar = str.ToCharArray()[parseinstr];
if ((sepList.IndexOf(curchar) !=-1)|(parseinstr==str.Length))
{
word=str.Substring(posinstr, parseinstr-posinstr);
if(word.Length>0)
{
if (isKeyword(word))
{
//MessageBox.Show(word);
primary = primary.Insert(primary.Length-word.Length,((char)225).ToString());
primary+=((char)225).ToString();
}
posinstr=parseinstr+1;
}
else posinstr=parseinstr+1;
}
primary+=str.ToCharArray()[parseinstr];
parseinstr+=1;
}
while (parseinstr<str.Length);
return primary;
}
private bool isKeyword(string str)
{
if (keywordCS.IndexOf(str) != -1) return true;
else return false;
}
Suspend and resume layout only affect resizing and redrawing of controls, from what I got out of my experimenting with it this afternoon.
It turns out, the original method using the RichTextBox.Find is still faster than the char-by-char method, but not because of calculation time.. The actual char by char writing to the RichTextBox, whether visible, disabled, suspended, or not, is incredibly slow.
I can load a huge file into the RTB in almost no time using standard streamreader read methods, but char by char it's way too slow. I did some testing with my parsing method, and if I use special characters to define where colors would stop and start, rather than actually coloring/writing them to the RTB, and write that string to the RTB all in black, the text loads at what is not noticably any slower speed.
Conversely, if I read in the strings one line at a time, and add them to the RTB one character at a time, it's pretty much the same speed as it is when I am writing the formatted text one char at a time.
So, is there a way, to build a Rich Text string with formatting, and write it line-at-a-time to the RTB? I think if I could do that, the speed would go right up where I want it.
And, after doing one the slow way, I saved it to an RTF file, and viewed it with notepad. I can see all the formatting, and so on, so I thought I would try to build the formatted strings and insert them using the RichtTextBox.Rtf method instead of using SelectedText. Unfortunately, I get a file format error every time I try to insert a string I'm gonna keep playing with it and see what I can do..
OK, so that worked well. now that I've figured out how to format RTF text (and not have pieces vanish?!@#) It's good and fast. Only one problem.
RTF codes use our friend the backslash : \ to signify that there are functions coming. So, if I want to actually have that character displayed, I have to send \\. No big deal, just like C# actually. The problem comes in with the string quotes being colored in certain cases.
for example:
string myString = "\"Bill\""; would in C#set Bill to the value "Bill". So if my state is switched off at the first escape sequence for the quote, Bill would be black, rather than red.
i.e,
string myString = "\"Bill\"";
A more serious problem occurs if there is only one of these in a string, like this:
string mystring = "\" <- Thats the quote";
Where that string would translate to : " <-Thats the quote. Since there are an odd number of " in there, the first one would set it would look like this:
string myString = "\" <- Thats the quote";
nextline.programming = still.red;
etc,etc
until I hit another "
So, I thought I'd be smooth and say, OK, don't turn off the quote mode if this quote is proceeded by a \. Well, that fixed 90% of the problem, but now I have another problem.
This is a valid string in C#:
string myPath = "C:\\Documents and Settings\\Administrator\\";
So, when the parser hits that last quote, it's now being ignored because the \ is in front of it.
Instead of looking back, I say when you hit a \ inside of a quote, note it. It always indicates that an escape is coming. So you automatically know the next char is escaped and you can ignore it for rules, so something like this:
Man, this crap's always harder than it looks. Damned C# Flexibility. Now I need to handle the @ symbol somehow too. Because, this is also a legal string..
string bill = @"\Bill\";
I guess if the @ character is seen, I can just go by the straight quotes, ignoring the escape characters (because in C# the escape \" isn't valid when you have the @ in front of it..)
Sweet, so it looks like I'm done with this part of the deal, unless you can think of another way to break it.
Bill
Last edited by conipto; Nov 19th, 2005 at 10:18 PM.