Hello.
The Trick has just told me that he managed to get Debug Symbols for VBAx.DLL (aka EB part in EB/Ruby terminology). Our further conversation led me to this topic and your discovery.
This is cruical exploration in context of VB reverse-engineering, I must say.
As you know, I have spent some years trying to reverse-engineer VB6 as a whole product. Diving into msvbvm guts was relatively easy work, while digging into VBAx.DLL was really challenging, since you need to guess purpose of every function, every variable, and there is no hint except of your own interpretation of what's going to happen at this particular code.
So, I was wondered to match my own guessed function/variable names with "official" ones to check how accurate was my assumptions regarding nature of things that functions do and tasks they are supposed to solve.
It's now less than a hour passed since I opened your DBG file, but I've already matched dozens of entities with my own reverse-engineering results. I think the next few days or weeks will be very exciting.
So,
as I mentioned before, I discovered/reverse-engineered/described like 2000 or more functions/classes/variables from VBA6.DLL.
For obvious reasons they were given temporary names (guessed names) which are different to authentic ones (which are now available from VBA5.DBG). However, for many of these functions I have very descriptive articles describing each argument, the behaviour of function and C languange pseudo-code illustrating how original source code might look like.
I will be definitely usefull if all information I collected during these year can be merged information that can be extracted from this DBG.
Probably I should make public portal and open all my reverse-engineering documentations, so that you can take function/method from this DBG and read what does it do and what arguments does it take from my articles?
One major problem is that names within my work do not match official or authentic Microsoft names (available in DBG). They need to be matched manually, at least until I/we make a tool that can match names by analyzing/traversing CFG/EFG (code flow graph/execution flow graph) of two binaries (VBA6.DLL and VBA6.DLL).
Here's some observations I've already made (even though I got VBA5.DBG less than a hour ago):
1. Official names of functions and variables seem to be more brief than names assigned by me. However, they are less obvious and often confusing.
2. They didn't follow one consistent naming scheme. Some functions do have prefixes indicating their relation to a particular component of EB engine, and some don't. Some classes are named using CCamelCaseClass scheme, some are use UPPERCASE names with lots of abbriations.
3. For instance, there is class which I called
CSmallHeap. It turned out that it was originally called
BLKMGR32 (bulk memory allocations manager).
4. There is a lack of consistency in how they call a single thing or conception in different places.
Let me make small introduction into how VB IDE represents BASIC code under the hood. VB IDE is an example of very clever technique where parsing/analyzing and compilation of source code starts at the moment of inserting it into IDE window. VB IDE does not store source code as big plain text, nor it stores it as an array of lines or rows. When you finishing editing a line of code by pressing Enter of moving caret to another line using mouse or keyboard arrows, this line is parsed, analyzed and converted to more compact binary format of representation. Code isn't represented as text, lines of text or parts of lines of text internally. Additional internal structures and entries are created exactly when you type a new line with declaration of new Sub or variable. In other words, at any given moment of time VB IDE already knows all its subs, functions and other entities defined in code, contrary to traditional compilers which start to parse/analyze code only when compilation/run pass is invoked.
I use the following terminology.
1.
Line as a text. The line of VBcode exist in form of pure text only during line editing process. As soon as you finish line by pressing Enter or moving caret, a line will undergo few transformations. The first one is parsing and building PCR tree from sequence of characters.
2.
PCR tree. PCR stands for Parsed Code Representation. PCR tree consists of nodes, a node may represent a whole statement and its subnodes may represent some details of statements, literals, name references and so on. For every syntactically correct line of VB code PCR tree can be built and is actually built during parsing process. Syntax error encountered during parsing abort process of PCR tree construction and leave PCR tree unfinished. In any case, PCR tree is short-living form of VB-code source line representation.
3. No matter whether line was successfully parsed or there was an error (Syntax error), yet another form of representation is used for long-term storage of VB code lines. I called this
BSCR. BSCR stands for Binary Source Code Representation. Before I discovered BSCR, I though PCR is actually used for long-term source code storage. Although PCR is "binary" too, there are major differences between PCR and BSCR. PCR is short-living and BSCR is for long-term storage. PCR tree is made of nodes, each node is a structure containing 32-bit pointers to child PCR nodes (if any). PCR tree therefore can be rearranged or inserted into another PCR tree (so it becomes subtree) just be setting 1 pointer. In contrast to PCR, BSCR is a linear sequence of 16-bit words (I call them BSCR entities). There is no pointers in BSCR representations. The relationship between entities in BSCR is encoded by the order in which BSCR entities lie in memory rather than by pointers to child elements. Therefore BSCR data can be easily moved in memory to a new location. While PCR trees can only represent syntactically correct lines of VB codes, BSCR supports representing invalud code. While comments and labels are not part of PCR trees, they are always part of BSCR data for code lines.
4. When source code must be rendered on a screen, saved into file on disk or temporarely represented as a text for performing search (Ctrl+F), the reverse transformation takes place. I call this
TSCR. TSCR stands for Text(ual) Source Code
Reconstruction. Sometimes I call initial form in VB-code representation as TSCR too (in this case TSCR is likely to be Textual Source Code Representation). TSCR does more than reconstructing text (a sequence of characters that can be rendered on a screen) which matches original line of source code, it also generates syntax-highlighting map and provides calling side with a so-called capturing feature: having a logical reference to some sub-expression within the line, you can determine which range of characters it spans (used for mouse-hovering tips, yellow highlighting of current statement during step-by-step execution and so on).
With all names that I assigned (by guessing) during my work, I use a prefix (Parsed/Pcr/Bscr/Tscr) to determine which mechanism this function relates to. The authentic names do not follow this rule, and sometimes the same thing is called differently which brings confusion (at least, in my opinion).
The thing that I call
PCR is turned out to be called
PT (parsing tree, obviously?).
The next stage which I call
BSCR and
BSCR entities is sometimes referred to as
Opcodes, and sometimes as
Pcode. This annoys be, because term 'PCode' is already used for byte code of VB virtual machine, and BSCR is definitely is not the same as P-code. Opcodes isn't very suitable name too, because it creates confustion with opcodes of P-code or opcodes of native code x86 (yet again, BSCR entities are neither P-code opcodes nor x86 machine code opcodes).
All functions involved into reconstructing text representation for BSCR representations turned out to have "Ulist" prefix in its names. What is U in Ulist? Who knows...
Here is example of matching my names against authentic names:
class CBscrLines
- Class CSmallHeap turned out to be BLKMGR32
- Class CVbaProject turned out to be GEN_PROJECT
- Class CBscrLines turned out to be OPCODE_MGR
- mynmParserParseLine turned out to be static method parser::parseOneLine
- mynmParserCommitPcrNode turned out to be ptCreateOperNode
- mynmTscrReconstructLine turned out to be ListOneLine
- mynmTscrProcessLineBscr turned out to be ListOpcodes
- tblBscrToTextHandlers array turned out to be g_fnListRules
- mynmTscrEmitParticles turned out to be ListRule
- CBucketTree class turned out to be RECLIST
- CBucketTree::TraverseTreeOfElements turned out to be RECLIST::FTraverse
- mynmTscrEmitKeyword turned out to be SingleWord
- mynmTscrEmitTwoKeywords turned out to be DoubleWord
- mynmTscrStartNewStatement turned out to be ResetGlobals
- mynmTscrHandleDefXxxStatement turned out to be UListDeftype
- mynmTscrHandleForEachStatement turned out to be UListForEach
- mynmCbLengthOfBscrVl turned out to be CbVariableSizeOpcode
- mynmXstackCleanUp turned out to be parser::parseRsrcRelease
- CMSNR class (module-specific name-related, represents module scope actually) turned out to be TYPE_DATA class (very misleading name...).
- mynmBscrWriteCodeLine is actually parser::generatePcode.
- mynmParserParseOptionStatement is actually called parser::parseOptionStmt
and so on. Hundreds or thousands of such entities.
I understand, however, that for you both my name and original Microsoft's name don't make much sense.
Anyway, I am amazed by this DBG file. Even though original names looks unfamiliar to me and misleading in some case, this is a great source of information about those functions and classes that haven't been toched by my reversing process at all.