Feh. Reviewed string stuff with scc. It looks great. He needs to talk with warren more to make sure that everybody buys in. Helped hyatt fix his last to PDT+ bugs. Started working on a document for XPConnectifying the DOM.
Put up a document on lanugage-agnostic script support. It's pretty feeble, right now; mostly just recycled emails and stuff coughed up from a bug that I'd filed on myself. I need to start digging into the code to grok what's going on.
Tinkered with btek all afternoon to get its hard drive upgraded.
Late last night I had a moment of clarity with respect to the problem with the template builder and assertion ordering (the whole RETE thing was a bit of a red herring).
Specifically, we'll maintain a state machine that represents the template rule match state for an RDF resource in the template builder's database.
The state machine will be constructed from the
<rule> tags in the <template>.
The "alphabet" will be the condition tests that a rule needs to make;
in other words, the state machine will test the condition of a rule to
change from one state to another. The set of states and transition
function will be constructed such that, when a rule matches, the state
machine will be in an "accepting" state.
The construction will be as follows.
To determine which rules are matched, construct a string from all of the conditions that are currently true. If an accepting state can be reached in Mr from M's start state, then rule r is currently matched.
Two hour meeting yesterday to go over tree control for the next six months. No concrete action items. I think jar was going to take the "green, green, red, yellow, red, red" plan to the PDT meeting and see if it was reasonable. chofmann was going to work up something that we could post to describe "green", "yellow", and "red" states. I should bug 'em to make sure shit gets done.
Sunday night, I managed to get an implementation for a finite automata, as well as code that converts an NFA to a DFA. Monday morning, I built code that will do the original construction that I described Friday from a set of rules. The bad news is, it pretty much is O(2n). Ack.
So, I tinkered with some alternate constructions yesterday without much luck. I started considering trying to just use the NFA directly, maintaining the current state in a bitmap, and then started to think maybe it's best to go back to the drawing board here.
Mozilla meeting. Feh.
String design review with scc, warren, and rickg. Went pretty well: I
took some notes just in case anyone forgets what they're supposed to
be doing. The plan-of-record right now is that scc will land his
interface changes plus mods to nsString this week, with
rickg to land nsStringX next week or the week
after. (This was a point of contention, actually.) We didn't get into
any of the nitty-gritty details of why rickg and scc were fighting,
bug hopefully if this crap arises again in the future, we'll have the
stringjockeys alias to deal.
scc and I talked a bit about this rule matching problem. No great idea
with respect to the DFA construction, unfortunately. One interesting
realization was that there is an "implicit" condition on the rule;
specifically, (?a ^child ?b) (replace ^child
with whatever the active containment properties are). So a rule
really looks something like:
{(?a ^child ?b)
(?b ^is-container true)
(?b ^can-subscribe true)
==<
(build "<blah blah blah />")}
Th at is pretty important, because even if I did work out a DFA construction that used a reasonable amount of space, I'd still have been screwed by the fact that I didn't remember the partial state from the "implicit" rule. So maybe RETE wasn't such a dead-end after all...
I just can't leave this DFA idea alone. Pulled out my old Logic Synthesis and Verification Algorithms text from school. There were some clever schemes in there for encoding states of DFAs. I thought that maybe I could use one of those to remember the partial state in an NFA. All the NFAs that I construct seem to have a very reasonable number of states and a reasonably sized state transition table. I guess I'm thinking that it would be cool if I could work out some bit-twiddling to handle the non-determinism.
I'd like to make it eventually be possible to do something like:
<rule> (?a ^related-sites ?b) (?b ^instance-of rdf-sequence) (?b ^member ?c) ==< (<treechildren> <treeitem uri="?c" /> </treechildren>) </rule>
In other words, let you do arbitrary graph traversal to construct a content model. Maybe the XUL syntax would be something like:
<rule>
<conditions>
<condition source="?a" property="http://blah#rdf-foo" target="?b" />
<condition source="?b" property="http://blah#rdf-bar" target="rdf-sequence" />
<condition source="?b" property="http://blah#rdf-baz" target="?c" />
</conditions>
<action onaction="doSomeJS();" >
<treechildren>
<treeitem uri="?c" ... />
</treechidren>
</action>
</rule>
Feh. Maybe I'll make love to the template stuff.
Talked to hyatt for a couple hours yesterday afternoon about doing the RDF for skins. It seems like he has a lot done already, and is just looking for a bit of help with respect to laying out the RDF graph. He'd like me to help, but since I'm strapped for time over the next few days, I'm thinking he'll probably be able to finish up most of it himself.
Made it to the Sloan library last night and picked up Forgy's paper on RETE. Read it, noodled on it for a bit. Seems like the hard part is going to be figuring out how to use the network to drive the initial construction of the content model. Once that's done, using the network to handle incremental updates should be pretty straight-forward.
Updated the Performance Laundry List with fifty or sixty new "perf" bugs that had appeared since February 4th in preparation for meeting with Intel tomorrow.
Wrote a quick document called "Mozilla Needs Linux Help". I'm not sure where (or even if) I should post it. I'll bug somebody later.
Mitchell and I prepared for battle with Intel tomorrow.
Talked to David Ascher of ActiveState fame about Python stuff, well, mostly jband talked to David about Python and XPConnect stuff.
Kicked out some RDF/XML files for hyatt's chrome registry stuff.
Fixed a last-minute Beta bug (maybe) that has to do with incorrectly re-serializing RDF/XML out when the data is not straight ASCII.
Trip up to Intel to talk about Getting Involved. Lots of notes.
Finally read Miranker's paper on TREAT: as far as I could
tell, the main idea is to avoid storing the computed joins ("beta
memories"), and just use locality to look up the results in the
discrimination nodes ("alpha memories") when you needed them. It looks
like something like this will be the right thing to do here.
There are a couple of mismatches between the kind of system that RETE and TREAT are applied to, and what we've got. First of all, both RETE and TREAT seem to deal with frame-like structures where you hold a reference to the frame in the alpha memory, and the frame percolates downward through the discrimination network when its slots match. We've got a much looser "semantic network" kind of representation, where maintaining all of the assertions about a particular node in the network may become unwieldy.
I think this may become a problem if we have several discrimination nodes "stacked up" in a linear fashion: although we mail fail the first test, we'd pass the second and third. This means that we either need to do a combinatorial "network" like I was thinking about for the DFA representation, above, or that we'll need to return to the graph to run tests each time an element "falls through" to a new discrimination node.
Secondly, we're not going to get all of the working memory elements "added" to the network at initialization time. Instead, the network will need to be built up incrementally, if possible. For example:
(Root)
|
+------+-------+
| |
| v
| (?c color blue)
| |
v v
(?a child ?b) (?c type foopy)
| |
+->(?b == ?c)<-+
|
v
(Rule 1)
Here, we start with the root resource in a template, proceed down matching the "child" predicate, and then potentially need to proceed back up through "type" and "color" to backfill the join criteria. That isn't all that hard to do, but it'd be nice to do it in such a way that avoided continual recomputations of the alpha memories (specifically, remembering things that don't match).
Okay, so let's play around with a "real" example, the "new folder picker".
1: {(?a child ?b) (?b cancreatesubfolders true) (?b iscontainer true) (?b isempty false)}
2: {(?a child ?b) (?b cancreatesubfolders false)}
3: {(?a child ?b) (?b cancreatesubfolders false (?b iscontainer true) (?b isempty true)}
4: {(?a child ?b) (?b cancreatesubfolders true)}
(Root)
|
+--------------------+----------+-------------------+
| | |
v v v
(?a child ?b) (?c cancreatesubfolders true) (?d cancreatesubfolders false)
| | | | | | | |
| | | +-----|----------------|----------------|-------+ |
| | +------+ | | +---+ | |
| +-------+ | | | | v v
| +-|---+ v v (?b == ?d)
| | | | (?c iscontainer t) |
v v | | | | v
(?b == ?c) | | +-----+ +------+ (2)
| | | | |
| | | v v
v | | (?c isempty f) (?c isempty t)
(4) | +------------|----------------+ |
| | | |
+->(?b == ?c)<-+ v v
| (?b == ?c)
v |
(1) v
(3)
Looking at the above makes me think that we're missing the piece that
connects it into the content model. There are two problems. First, how
do we "seed" the network to get it started. Second, if a new assertion
arrives (foo cancreatesubfolders false), how do
we go back "up" through (?a child ?b)?
So let's say we extend the syntax to allow addressing elements by
id; e.g., rule 1 would be written like:
<rule> <element attribute="id" value="?a" /> <rdf source="?a" property="child" value="?b" /> <rdf source="?b" property="cancreatesubfolders" value="true" /> <rdf source="?b" property="iscontainer" value="true" /> <rdf source="?b" property="isempty" value="false" /> </rule>
This might make the network look like:
(Root)
|
+--------------------+----------+-------------------+
| | |
v | |
(element id ?a) | |
| | |
v v v
(?a child ?b) (?c cancreatesubfolders true) (?d cancreatesubfolders false)
| | | | | | | |
| | | +-----|----------------|----------------|-------+ |
| | +------+ | | +---+ | |
| +-------+ | | | | v v
| +-|---+ v v (?b == ?d)
| | | | (?c iscontainer t) |
v v | | | | v
(?b == ?c) | | +-----+ +------+ (2)
| | | | |
| | | v v
v | | (?c isempty f) (?c isempty t)
(4) | +------------|----------------+ |
| | | |
+->(?b == ?c)<-+ v v
| (?b == ?c)
v |
(1) v
(3)
This feels a little bit like we're opening Pandora's box with respect
to element matching. We can do quick lookups for elements with an
id attribute, but what if somebody specified something
else? (Also, strictly speaking, this would not suffice to "seed" the
template, because you'd want the ref attribute.) Maybe a
better way to write it would be:
<element uri="?a" />
This leaves the door open for adding attribute and tag conditions, as well.
Feh. I could meander about with respect to syntax; maybe that's not such a good idea right now. Let's get back to matching.
I think there something wrong with the way I've drawn the above
network. I think that I can't have a discrimination node with more
than two parents; e.g., ?c iscontainer t). How
could the matches computed by one parent be shared by the other? So
let's try again...
(Root)
|
+----------------+----------------+
| |
v v
(element id ?a) (?c iscontainer t)
| | |
v v v
(?a child ?b) (?c cancreatesubfolders true) (?c cancreatesubfolders false)
| | | | | | | |
| | | +-----|----------|-------------------------|----+ |
| | +------+ | | | | |
| +-------+ | | | | v v
| +-|---+ | | (?b == ?c)
| | | | | | |
v v | | | | v
(?b == ?c) | | v | (2)
| | | (?c isempty f) |
| | | | v
v | | | (?c isempty t)
(4) | +------------|----------------+ |
| | | |
+->(?b == ?c)<-+ v v
| (?b == ?c)
v |
(1) v
(3)
Started banging out some code to do the matching. I have this thing
called the MatchState that will basically build the
conflict set. Every node in the network is an InnerNode,
except for the InstantionNodes that are that the
leaves of the network. DiscriminationNodes are used to
run individual "tests" (e.g., (?c isempty f)), and
JoinNodes are used to propogate the results of joins.
I'm thinking that what should happen is that, when I'm given a new
"working memory element" (which I'm calling Element for
now), I'll hash into the network to find the
DiscriminationNodes that would be affected by the
changing Element. (There may be some redundancy in the
network if it's built in a degenerate fashion, or the rules are simply
complicated enough so that there can't be sharing.)
For each node that I initially discover as being affected by the test,
I'll first propogate up through the parent chain to the
root. I'll verify that the new Element actually passes
the set-inclusion tests for each parent
DiscriminationNode, stopping if it turns out that the
Element fails.
If we can trace a path back to the root, then that means that the
Element has satisfied all of the conditions to this
point. Now, we propogate it downward through the network.
While going down, we may encounter a JoinNode. We'll
either enter through the left- or the right-branch of that node, and
will need to propogate set inclusion back up the other branch to
verify that we've actually satisfied the conditions of the join. If
this fails, we halt. If it succeeds, we continue to propogate the
addition down the network.
As we process JoinNodes, we'll begin to build up variable
binding information that we'll need to use for the instantiation code.
Right. So after some more thought last night, I think I am starting to realize that we'll need to build up binding information for the full syntax. For example, you may have a rule whose conditions look like this:
<rule> <element id="?a" /> <rdf source="?a" property="child" target="?b" /> <rdf source="?b" property="color" target="?c" /> </rule>
That is, you want to bind the color property into
?c and pull it through to the RHS. So that network would
look something like:
(Root)
|
+---------+---------+
| |
v |
(element id ?a) |
| |
v v
(?a child ?b) (?c color ?d)
| |
+--->(?b == ?c)<----+
|
v
(1)
Hrmph. I'm not sure I'm doing this right. I'm wondering if I'd really end up with networks that looked like this:
(Root)
|
+--------------+---------------+
| | |
v v v
(element id ?a) (?b child ?c) (?d color ?e)
| | |
+->(?a == ?b)<-+ |
| |
+-------->(?c == ?d)<---+
|
v
(1)
This network is a bit more scary with respect to propogating tests
back "up"; for example, if I'm given the assertion (foo color
blue), I need to skip a level to compute how to bind
?c! Ack!
So let's say that, as we construct rules, each variable in a rule gets
a unique identifier (like I've done in the last network). That means
that part of negotiating the network involves assigning bindings to
these variables. So let's take the (foo color blue)
example and try to work out what would happen:
(?d color ?e)
node. We assign bindings {(?d foo) (?e blue)}
(?c == ?d). Here, we can add to the bindings (?c
foo).
(?a == ?b).
(?a == ?b) needs to
determine if there is anything consistent in the database with the
bindings {(?c foo) (?d foo) (?e blue)}. It needs to do a
"join" operation over the datasets of its ancestors. Okay, complexity
alert: let's talk about this below.
?a and ?b, which can then be returned to
(?c == ?d).
(?c == ?d) can then propogate the bindings downward to
the instantiation node (1), and voila, in goes the rule
to the conflict set. [As an aside, this makes me realize that
we'll need to always remember the entire conflict set: newly matched
rules don't automatically fire.]
Allright, so let's talk about computing that join. There are several problems here.
First, both sets are extremely large: (element id
?a) is every element currently in the content model; (?b
child ?c) is probably a bit larger, because it includes
elements not in the content model. A naive join
implementation will degenerate to O(n2) (or
O(mn), however you want to look at it). A better implementation
could look at the current bindings and compute subsets to make
m very small (like, one!), and then do some kind of
reverse-indexing on n to make it so that the whole operation
ended up being two hash table accesses or so.
Second, although we can compute ?b from ?c,
this will require us to use the RDF GetSources() API,
which is inconsistently implemented in datasources.
I guess what's going unsaid here is that I was hoping that the discrimination nodes could be implemented in such a way that they wouldn't actually have to store anything. Problem The Second, above, makes it very difficult for me to do that.
Forging onward through adversity, let's say that our nodes advertise
what bindings they participate in. This is a first step towards
solving our "Big m" problem. (?b child ?c) indicates the
transitive closure of variables that it binds (both itself, and its
ancestors). We see that he binds ?c, where (element
id ?a) doesn't bind anything that we have data for yet. So up
we go to (?b child ?c).
Now maybe we have some way of asking a datasource whether or
not it can efficiently do GetSources(). (Or, if it can do
GetSources() at all!) If not, then we'll just
suck it up and maintain an "alpha memory" in the node. So, in this
case, maybe it's an in-memory datasource that can quickly answer our
query, and we just do GetSources(child, foo) to figure
out the set of parents. Voila, a binding for ?b and an
m of one!
Then over to (element id ?a). Since the join node can
bind ?a to ?b, we should be able to do a
quick lookup in the document's element map to see if there is actually
some content in the content model that we can whack. (We'd also need
to verify that the content is something under the purview of the
current builder, of course.)
Another Hard Part is dealing with the fact that we might've found
several bindings for ?b; that is, ?c
might've been the child of several different nodes in the RDF
graph. Since that's the case, it seems like we'll need to somehow beef
up our notion of the "current bindings". I'm guessing that the naive
way to do this is the correct way: specifically, keep track of sets of
products explicitly, rather than trying to create a product of
sets. For example,
{(?a foo) (?b bar)}
{(?a foo) (?b baz)}
Rather than:
{(?a {foo}) (?b {bar, baz})}
My intuition here is that you could construct a case where one of the products-of-sets was not valid.
Okay, much later. I have a first cut at an implementation. Still need
to work out how to do the big mega join thing, but I think I have a
bit of a clue. Still some klunkiness with regards to the
JoinNode and what assumptions we can make about variable
binding when we arrive at the node. Put a ton o' comments into the
code to talk about it. Hopefully I'll work out the kludges.
I made over-use of the ->* operator. I think I should
just revert that to static callbacks that take the node as a
parameter. Maybe. I dunno.
So I've been thinking more about this variable binding stuff. I have this idea that it's probably safe to assume that, in a set of bindings, the same variables have been bound (albeit to different values). The cool thing about that is that you can look at the variables that have been bound in the first binding, and use those to deduce what's left to bind.
It would also mean that you could construct bindings doing some kind of copy-on-write mechanism; e.g.,
+-- (?c bop) <-- (?d bop)
|
(?a foo) <-- (?b bar) <--+
|
+-- (?c bip) <-- (?d bip)
I'm kinda suspicious that this may be required for fan-out as we go down the network. For example, take a discrimination node that has two children: one set of bindings may need to be worked out for the first kid (which winds its way down to rule one), and another set of bindings worked out for the second kid.
Okay, thinking about this a bit more, I'm beginning to see that I may be prematurely optimizing with this match state stuff. The "upward" and "downward" operations seem distinctly different here: going up, we are strictly constraining an existing set of bindings; going downward, we are dealing with fan-out and possibly alternative expansions on the same binding.
Okay, got things "working" in the more complex case that I illustrated
above (after making everything recursive instead of iterative, and
doing a brain-dead implementation for the Instantiation
and InstantiationSet objects). A long way still to go,
but the join stuff seems to be hanging together -- at least in the
simple case I mentioned above. Whee!
Gotta figure out how to do the conflict set. There are a couple of things it needs to be able to do.
Hrm. I'm trying to figure out how to find all of the active rules for a given "master URI" so that I can sort them and ensure that the one with the highest priority is fired. Let's say I have a rule that looks like this:
<rule nc:CanSubscribe="true" iscontainer="true" isempty="false"> ... </rule>
<rule> <element id="?a" /> <rdf source="?a" property="child" target="?b" /> <rdf source="?b" property="nc:CanSubscribe" target="true" /> <rdf source="?b" property="iscontainer" target="true" /> <rdf source="?b" property="isempty" target="false" /> </rule>
How do I know which variable binding is important to "identifying" the
rule? Or is it even correct to be thinking about a variable binding
identifying the rule at all? It seems like maybe I could look for all
bindings with identical matches for ?a and
?b (e.g., because of the property="child"
stuff), but that'll break down as soon as you introduce a rule that
"skips" children.
Maybe we have to do it based on the RHS of the rule. We could look at
the "topmost" uri="..." magic, figure out its binding,
and look for all rules with that binding? That should work in the
simple world where all the rules will likely share the same prefix
(the first two conditions of the expanded rule, above). We could
always look for ?b.
So maybe a more correct thing to do would be look at a new
instantiation, figure out how it's bound to the uri="..."
magic, then figure out amongst all the rules, how each of
their magic is bound. For example, our rule above bound it to
?b. Another rule may have uri="..." bound to
?q. So we grovel through the conflict set searching for
bindings of the newly instantiation URI to ?b and
?q.
Hacking hacking hacking. Something I need to do: get rid of these "default" frigging containment properties. They need to die die die in favor of explicit ones. The only default thing I want to have left when the smoke clears is RDF containers.
Allright, I think I have the LHS stuff pretty much working (at least, to a first order approximation). Now I need to hook up the conflict set and the RHS and see what happens.
Woohoo! Got the bookmarks window comin' up baby. I get an assertion in
the filesystem datasource as soon as I try to open a folder, but hey!
Not bad. Anyway, I've been littering the code with
XXXwaterson that I need to remember to go strip out.
On the drive home I started thinking about what I need to do to determine which rules are matched. I think that I need to maintain a map that is keyed by the content node beneath which we're building and the URI of the "pivot" node. As we propogate changes through the network, we need to remember the <content, URI> pairs for newly matched rules. Then, we'll examine the set to see which rule has the highest priority. In the incremental update case, it may be useful to remember if anything was matched before, so we know whether or not to nuke old content.
Boy, this makes me realize that we'll also need to recursively propogate incremental updates. If a newly matched rule causes different to be created, we need to nuke the element's kids and re-create. The new kids may be constructed differently; e.g., because of a condition that depends on the parent tag. Not rocket science, but something to remember.
Spent most of the evening banging through string stuff with scott.
Struggling with the conflict set some more.
I got asynchronous additions working (for the most part) today, and
was able to re-factor a lot of the code that lived in the old
OnAssert, OnUnassert, and
OnChange. I figured out the difference between
Iterator and ConstIterator, and made those
translations. I'm pretty much using Iterators everywhere
now. What a C++ geek!
Argued with jar and warren about "Netscape Confidential" bugs, the Mozilla-vs.-Netscape cage match du jour.
Triaged all my M14 and M15 bugs. Yay!
So the problem that I'm staring down right now has to do with rule
retraction. I've got an OnUnassert coming in
with a source, property, and target: how does this affect current
conflict set? My initial (broken) stab at this was to see if a node
CanPropogate the assertion, and if so, remove each
instantiation in the conflict set that contains its "seed" bindings.
This ends up being too aggressive. For example, just because I can bind to a parent element, doesn't mean I should remove every instantiation that has to do with the parent element. I guess I was equating a binding with a working memory element, which is wrong.
So if I wanted to pretend that I had working memory elements, what I'd
have to do is remember which assertions actually matched in an
instantiation, and index the conflict set that way; e.g.,
<foo, child, bar>. Then when this
assertion comes through on an OnUnassert, I'd just nix
any instantiations that "contained" it. This gets a bit trickier with
respect to all the "computed" working memory elements that some of the
test nodes "create". For example, emptiness and containerhood.
Allright, so here's an idea. Say we have the following network:
(root)
|
v
(content ?a)
|
v
(?a ^id ?b)
|
v
(?b ^child ?c) [A]
|
+--------+--------+
v v
(?c ^color blue) (?c color red) [B]
| |
v v
[R1] (?c empty true) [C]
|
v
[R2]
And the following instantiations:
[R1]
1: {(?a node1) (?b parent1) (?c child1)}
2: {(?a node1) (?b parent1) (?c child2)}
3: {(?a node2) (?b child3) (?c child4)}
[R2]
4: {(?a node1) (?b parent1) (?c child3)}
Now, if I remove (parent1 ^child child2), I'd like to see
instantiation 2 get nixed. If I remove (child3
^color red), I'd like to see 4 get nixed. So dig
this. Let's sayI ask the test node if it matches the
assertion without doing any backend lookup, and got the
variable bindngs back. Then I could remove any instantiations from the
conflict set that properly contains those variable bindings. (By
"properly contain", I mean, that has all of the variable bindings
returned from the test node, not just some of them.)
For example, say I get
OnUnassert(parent1 ^child child2). Node
A is the only test that will pass given these values, and
will bind ?b to parent1 and ?c
to child2. The only instantiation that contains both
these bindings is 2. So we can safely remove it from the
conflict set.
Similarly, OnUnassert(child3 ^color red). Node
B is the only node that will pass this test. It will bind
?c to child3. It seems that it is now safe
to remove instantiation 4 from the conflict set.
So this works out fairly well for the container-membership and
property-value tests, but I'm not so sure about the container
instance tests. Let's say we get
OnUnassert(child3 ^child child4), which will
make child3 be "empty". It's easy to see how we'd remove
instantiation 3 from the conflict set; however, how would
we remove instantiation 4? (And worse, what if there was
another rule with a condition (?c empty false)
which would all of a sudden be matched?)
If we bend the rules a little bit, and say that the container-instance test can be clever, and is allowed to check the graph, then maybe we can circumvent the first problem.
But what about the second problem? Ack. I'll work on that later.
Okay so on to removal from the conflict set. Right now I have two maps
in the conflict set. The first maps a binding to the set of
instantiations in which that binding participates. The second maps a
binding pair
<content-binding, member-binding> to the set
of instantiations in which that pair participates. The second table is
used to identify a "cluster" of rules that compete amongst one another
to match. (It's redundant, because you can certainly filter the
results from the first map to determine which rules are matched.)
Typically, the maps are used as follows:
<content-binding, member-binding> pairs.
Awright. It seems like my plan almost worked, but not quite. See, by
the time I get the OnUnassert, I can't just run
FilterInstantiations because the test will fail in
exactly the case where I want it to
succeed. FilterInstantiations goes to the RDF graph and
says, "hey, is this instantiation consistent?" Which it won't be,
because we've just revoked it.
The problems with relying on CanPropogate by itself are
1) you'll run the risk of yanking too many instantiations on
a partial match, and 2) even if you do get a full match, you don't
actually condition anything on the test itself (e.g., the
container-instance tests).
Feh. Ok, here's the deal. Let's say I add a "negated" argument to
CanPropogate, which means, could you propogate this if it
we're true. Or something to that effect.
Oh wow. I just realized that I totally fucked this up. I need to remember working memory elements that actually match, not just the bindings. Doh! An example, please.
Say we have the following graph:
(A)
|
+-----+-------------------+
| | |
child child child
| | |
v v v
(B) (C)--color-->blue (D)--color-->red
| | |
color child color
| | |
v v v
red (E)--color-->blue blue
And these rules:
1: (content ^id ?1) (?1 ^child ?2) (?2 ^color blue) (?2 ^empty true) 2: (content ^id ?1) (?1 ^child ?2) (?2 ^color blue) 3: (content ^id ?1) (?1 ^child ?2) (?2 ^color red)
This example breaks the bindings-only algorithm that I was trying to
use. Specifically, removing (D ^color blue) would cause
all instantiations containing (?2, D) to be
nuked, "hiding" the fact that (D ^color red) could be
used to match rule 3.
What this means (I think) is that the Instantiation now
needs to remember the "working memory elements" that have been used to
match, as well as the variable/value bindings. Since we don't really
have "working memory elements", the trick will be to somehow use the
CanPropogate mechanism to figure out how to turn a lone
RDF assertion into a working memory element that can be used to yank
instantiations out of the conflict set.
It also means that as we propogate tests downward through the graph,
we'll also be extending the instantation objects with "working memory
elements" that trigger matches. I'm not sure I know what will happen
in the case where we have to Constrain back up the
graph. Man, am I fucked.
Allright, I think that the way to do this is not to propogate
two lists around (as I was first inclined), but rather to treat the
working memory elements as "support" for a Binding in an
Instantation. My intuition is that this will make a lot
more sense when handling joins. I should probably work out why I think
that.
There are two cases we need to deal with. The first is the "constrained" case where a propogation is coming down from the left or right ancestor. One of our variables will be bound, and we need to ensure that the other ancestor has consistent bindings. The second is the "unconstrained" case, where a propogation is coming up from a descendant, and neither our left nor right variable is bound.
In the constrained case, I think we can just assign the working-memory support from the constrained variable to the unconstrained variable. Hmm. Strictly speaking, it's probably not even all that relevant. We probably don't care what support is issued from one variable to another.
Regardless, I like the support idea a lot. It seems cleaner. Let me take a quick peek at how we use bindings right now and see if it's consistent.
Allright, I got the "memory element" stuff in last night. After making the above list, I decided that having elements literally support a binding was a wrong-headed thing to do, and went back to having elements support the instantiation directly.
It all works great, except for one thing. I need to figure out how to
handle the RebuildContainer call. I'd made the mistake of
assuming that I could just look at the content element for its
id, and use that to yank elements from the conflict
set. Unfortunately, by the time I get asked to rebuild the container,
the id (or ref) attribute has long since
been changed to the new value.
I could revert the code to just rip stuff out of the content model,
but then the conflict set would be updated. Hmm. Currently I'm
treating ContentIdTestNode::Elements as being equal iff
they have the same content node and same ID. Maybe that's a
bit over-zealous: a content node will never have multiple IDs: maybe I
can just index the thing on the node alone...
Finally got Quantify to work with a large (~2,000) mail folder. Interesting results:
KeySet sucks and needs to use a hashtable. I should
instrument to see how big it gets, but I was seeing fanout of like
2,000 calls to Add() exploding to 8,000,000 calls to
Compare(). Damn!
IsContainer() take 10% of the
overall time - even with the new matching code in place!
new and malloc(). An
arena - especially for hashtable allocations - will be a big win.
I also have noticed that switching away from a large folder seems to be pretty expensive as well. My first guess is that it's because we're removing content from the tree in a random order, rather than from back to front. So putting the old "rip all the nodes out" code back should work fine. I'll also need to leave the conflict set maintenance code in there.
Did a quick n' dirty hashtable implementation for the
KeySet. It puts us right in the game, baby! I'm not
noticing a visible difference between the old and new builders in an
opt build. Both are around 3-4 seconds for a 2,500 message folder.