Welcome, Guest. Please Login or Register
 
The Purple Parade is marching in full stride to the beat of that 'other' drum we all hear, but generally ignore. Wink
Home Help Search Login Register


Pages: 1 2 3 4 5 
Send Topic Print
Project - Table File Standard (Read 68502 times)
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #30 - Sep 22nd, 2010 at 12:02pm
 
DaMarsMan wrote on Sep 22nd, 2010 at 10:52am:
I can see what you mean about making an exception for something that is almost always needed.


Is it really even an exception? "\n" really represents a character that can be part of any string. It is part of the map. The exception is it can be ignored such as in insertion because it's not necessary. It really depends on if you view it as a control or as a character. I can certainly see both sides, but view it more as a character myself.

Quote:
I would say that the best approach would probably be to have dump configuration file standards. I can see how you would have a problem with something like this though...

FE=\n<line>\n//

In this case the dump controls have to be mixed in with the table to get the proper output. Maybe if there were a dump configuration file you could just have an overwrite table character function for the control characters. With our current method, viewing these controls in a hex editor can get kind of nasty when every instance of FE is shown as the string above... An external, separate configuration file could solve some of these issues.


I don't see this as a problem. A hex editor would just ignore \n for in-editor display purposes. They do that already with line breaks and string ends, right? They don't try to actually show the line breaks, but they show up when dumped.

You can already make tables that use this character in Romjuice, Cartographer, and Klarth's Table Library. The situation has been around for a long time and has never caused an issue. I don't think it would start now. It's not changed much from what we had already.

As far as dumping and insertion standards. Klarth and I have touched on this briefly. It's certainly another topic for another day. However, one conclusion we started to reach is that most custom dumping scenarios could be accounted for if we had a dumper that had batch file and operation abilities, more robust pointer handling and/or scanning, tree structure handling, and robust table switching. We went over several scenarios where we had custom dumpers and why. It turns out, we could eliminate many of those situations by just a few improvements in those areas.

I will see if I can address some of those areas in my utility. We plan to pick the conversation up when he's back in the states at some point in the future. Perhaps we can have a public discussion then.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #31 - Sep 30th, 2010 at 11:57am
 
Some more changes:

Added:
2.2.1 No Entry Found Behavior for Dumping
2.2.2 No Entry Found Behavior for Insertion

Klarth thought it was worth defining/standardizing this behavior and how hex values should be output (<$XX>).

In addition, we decided a few limitations for simplicity of implementation. Since we only have one control code, literal "\n" is not possible to use in a table entry as it is always interpreted as a control code. A simple search and replace can be used with it.

Along those same lines, a raw hex sequence like "<$XX>" is always interpreted as hex, even if a normal table entry may overlap or conflict.

It's probably not the 'right' thing to do, but we've reached a point where we're not interested in further complexity or in depth modification of existing tools/libraries such as Atlas and Klarth's table library.

What we have already moves us forward and a compatible dumper and inserter should be a good stopgap for years to come until something new and better is created with appropriate tools.(such as new XML formats for tables and scripts etc.)

My aim here was three fold 1.) to pull together everything we had and standardize it. 2.) To improve and give us more. 3.) Remain somewhat compatible so we can actually have utilities that use this sooner rather than later.

I think I've about reached those goals. We have much improved features. We will have tangible tools and libraries. We will have my dumper. Klarth has agreed to update Atlas as well as his public C++ table library to use the standard. I can probably nag RedComet into updating Cartographer to use the revised version of Klarth's table library too.

With that said, that's probably enough to hold us over a long while knowing how this community operates. Smiley

I would say the document is in final review for content. Once content is finalized, I will see about putting in into a PDF or see if Klarth will. Then we just wait for the revision of tools. It will be awhile yet, but we're getting there.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #32 - Oct 11th, 2010 at 10:36pm
 
Quote:
1.2.
each line indicates what one hexadecimal binary number/s
equates to in text form


This is somewhat diffuse. I suggest "each line indicates what string of text a sequence of hexadecimal bytes equates to."

Quote:
2.1.3.
0001=Seven

           If a byte sequence 0x00 0x01 is encountered


First off, this is in the wrong area. I'd put 2.1.3.-2.1.5. that under 2.2., because they directly relate to normal entries, whereas their relation to encoding is pretty much non-existent, except when you mix file encoding and file syntax together, which you shouldn't.
Secondly, the byte order is at no point made explicit. Of course, we're always talking Big Endian for table entries, but that's only clear from examples, not from a direct statement.

Quote:
2.1.4
Text COllisions:


Typo.

Quote:
2.1.3.
When hex values overlap, the largest hex value should always be used.


"Longest hex sequence." The largest value would be ambiguous, cf.
    00=5
    EE=6
    00ED=7


EE is larger than ED. Of course this is not what is meant at all by your statement.

Quote:
2.1.4.
When text values overlap, the largest text value should always be used.


"When text values overlap the entry that represents the longest prefix for the current string of text shall be used."

Quote:
2.2.
     You can also do multi-byte entries like this:
     You can multi-character entries:
     You can combine the two:


    Multi-byte entries look like this:
    Multi-character entries look like this:
    A combination of the two looks like this:


Quote:
2.2.
2.2.1 No Entry Found Behavior for Dumping


"No-Entry-Found Behavior for Dumping" See Hyphen - Compound Modifiers.

Quote:
2.2.
Expected behavior in the event no table match is found for a given hex
value is to output the raw value in the following manner:


"no table entry is found to match a given byte [sequence]"

I think you mean this to be parsed byte-wise, yet a "hex value" that cannot be found could also be "998877". Is expected behavior to put this as "<$998877>"? This also swerves the subject of Endianess.
I would mention "byte sequence" above and then explain that each byte is printed separately, however, defining this per byte is also a resolution.

Quote:
2.2.
Note: In the event the "<$XX>" string may overlap or conflict with valid
table entries, hex value insertion should take precedent.


Either it overlaps or it doesn't. There are no entries that may overlap. Also, weak language again.

Quote:
2.2.
Expected behavior in the event no table match is found for a given text
value is to ignore the character and make no hex insertion.


"no table entry is found to match a given text sequence"

Quote:
2.3.
These codes are used by dumpers only and will be ignored by inserters.


Move this up to the start, right after "There are a set of formatting control characters you can use in any of
your table entries to control the formatting and output of your script.". So it will actually be read in context and not get lost in the example.

Quote:
2.3.
there can be no literal representation
of the control code character sequence


"there can be no literal representation of control code character sequences"

Quote:
2.3.
There are a set of formatting control characters you can use in any of
your table entries to control the formatting and output of your script.


"There is a set of formatting control characters any table entry may use to control the formatting and output of the script."

Quote:
2.3.
For flexibility purposes, you can use script formatting values like "\n" to
do something like this for line break and end string control codes:

This will produce something like this at the end of a string:


"For flexibility purposes, control codes like "\n" can be used to achieve effects like the following for line breaks and end string entries:"

"This will produce the following at the end of a string:"

Quote:
2.3.
"Commenting characters"


"Comment delimiters" (w/o quotes)

Quote:
2.4.
In actual text output, your line breaks would still be
controlled via "\n".


-your
Unnecessary and distracts from statement IMO.

Quote:
2.4.
The only requirement is end token hex values must be preceded by the '/' [...]

You may have as many end token entries as you need.



"End token entries must be preceded by a "/" [character]."
"There may be an unlimited number of end tokens."

Quote:
2.4.
A typically string end token might look like this in your table:
You can use any combination of formatting controls and text
representation. This allows for nearly any variation of a string end you want.


"typical"
"Any combination of formatting control codes and text representation may be used. This allows for nearly all variation of string ends."

Quote:
2.4.
In some cases, such as fixed length inserting or other situations, there
may be instances where no actual hex value should be dumped or inserted
when the string end is reached. The following format is acceptable in these
situations.


How can no actual hex value be dumped? If I don't want a particular hex value to be included, there surely has to be a way. Not setting a string for any entry is forbidden per 2.1.5., though.

Quote:
2.5.
[I]f you want to print 2 following bytes after a certain
control code is read [...]

           $0500=<Color>,1


Example is flawed. Also notice typo "If". Also reword without "you".

Quote:
2.6.
Multiple table files is a flexible [...] and dictionary.


"Support of [m]ultiple ... and dictionaries."

Quote:
2.6.
The "trigger" hex value must be preceded by the '@' character.


Seems like it is supposed to be preceded by the '!' character, actually.

Quote:
2.6.
NumberOfTableMatches is the number of table matches to match before falling back to the previous
table. Setting this value to '0' indicates indefinite matching should be done in the new table until an entry is not found in the new table.


"NumberOfTableMatches is the non-negative
number of table matches to match before falling back to the previous table. Setting this value to '0' indicates indefinite matching should be done in the new table until no matching entry is not found in the table that was switched in."

I also highly recommend to state that TableID may not contain "," and possibly should have a minimum length of some number of characters, so clashes will occur less often.

Quote:
2.6.
Let's Assume we start with table HIRA.


"assume"

Quote:
HIRA --> KATA --> ア --> イ --> ウ --> KANJI --> 意 --> 0x03 fallback to KATA --> 0x03 fallback to HIRA --> HIRO.


I am aware that this is my own example, however, HIRO is confusable with HIRA on a quick read-through. I suggest you change this to "<Playername>" or some other example that cannot be easily confused.

Quote:
2.6.


I would like to stress the round-trip case some more.

As I had it in mind while designing this, that a table change wouldn't automatically invalidate the old table change. So changing from a table 1 that is limited to a number of matches (like 3) to another table 2 that is limited, too (like 5), while not being really feasible, would be possible.

However the match would itself count as a match in table 1. I realize this is a borderline crazy case, but I think that this is the only way all cases can be accounted for.

I think epistemic problems only arise when changing from table 0 to limited table 1 and then in there change to a limited table 0 again. Practically, it's not easy to tell one way or the other, so I'd think this would stay the same as the other cases for ease of implementation. If some crazy game out there does something like that, it's too bad.

Quote:
3.0 How to Handle Common Situations


At least include a little prelude. "This section exemplifies how to handle common problems with..." or some-such.

Quote:
3.1.


In 1.2. you referred to the whole line as an entry, yet, now you only mark the right-hand side to be an entry. I suggest a cleaner explanation or maybe "DictEntry1" etc. In any case, reference to the table file itself and the example dictionary should be clarified.

Quote:
3.1.
In this case, every time hex code 0x30 is encountered, the table will
switch for one math


"match"

Quote:
3.2.
"Normal Entries"


It's not in quotes anywhere else.

Quote:
3.2. & 3.3.
See section 2.7      Table Switching for more information.


It's 2.6.

Quote:
3.2.
special characters


Believe it or not, some people actually consider these to be not special at all and would feel affronted.

Quote:
3.4.
This is a Japanese language specific issue. In short, Hiragana/Katakana
represent the same written syllables in two forms (one for transcribing
foreign words).


"(one for transcribing foreign words)" is not factually accurate and should be cut. Also, it should be "Japanese-language specific", see Hyphen - Compound Modifiers.

Quote:
3.4.
Hiragana/Katakana mark


Those are not marks, suggest "switching".

Quote:
4.1.
It's best practice when
reading the table file, as well as dealing with escape codes (\n) to
handle both types.


Those are actually three types.

Quote:
4.1.
Use encoding aware text/string processing
functions.


"Encoding-aware text/string-processing functions".

Quote:
4.2.
Please be aware some features of the table file format should be treated
different depending upon whether the task is dumping or inserting.


"differently"

Quote:
4.2.
    1.
    2.
    3.
    4.


You didn't use numbered lists for any other section. And here, it's not even required to rank these items.

Quote:
4.2. 3 & 4
Linked Entries are only needed for dumping purposes. They act as a
normal entry for inserting. Inserting would flow through the stream as
normal inserting seeing the hex values in the script.


So it wouldn't catch that "<Color>" means 0x05 0x00? That doesn't sound right...

Quote:
4.3. 2
EndTokens


First time this is written like that.

Quote:
4.3. 3 & 4
Array entries


This is the first time this term is used. You talked briefly about array encoding, but I think this is a remnant of the arrays in the former draft.



The indentation of all chapters after 1 does not follow the same rules as chapter 1. Chapter 2 uses irregular indentation for 2.3. and 2.4.

In general I have to say your language is not very normative. There are shoulds everywhere. I would recommend changing most of those to shalls and musts.

Also you overuse "hex value". Value is something that has to be measured in units. Yet, you actually talk about those units (bytes etc) while saying value. IMO hex value does not qualify for what you're trying to say. It's a sequence of hexadecimal bytes etc. Sequence as in they are stored sequentially in the file (indeed, I know many specs that make this very explicit by explaining BE and LE all over again).

As I have shown above, I would personally eliminate all usage of "you".

cYa,

Tauwasser

Edited:
No italics everywhere anymore.
Back to top
« Last Edit: Nov 1st, 2010 at 9:43am by Tauwasser »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #33 - Oct 25th, 2010 at 3:22pm
 
Thank you for that pain-staking edit. It certainly needed it and the document has definitely increased in quality as a result. Smiley

I agreed with and implemented nearly all changes. I will try to be more cognitive of my passive language going forward. I left a few 'shoulds' and 'yous' in section 3 and 4 for suggestions, as they are not absolute.


A few other questions:

Quote:
*2.2.1
Expected behavior in the event no table match is found for a given hex
sequence, when reduced to a single byte, is to output the raw value in the following manner:


I am unsure how to express what I want to say here. You've got a given a longest hex sequence of 3 bytes.  The next 3 hex bytes are say $9b7733. If $9b7733 is not found, you search for $9b77. if $9b77 is not found, you search for $9b. If $9b is not found, you finally output <$9b>. Next iteration you start searching for $7733XX That's how dumping should work. I'm not sure how to express that here.

Quote:
2.4
An end token can be defined having no actual hex sequence associated
with it. Insertion utilities can use these as indication of end of
string for pointer operations, but not insert an actual hex
representation, such as in the case of fixed length strings. The
following format is acceptable in these situations.


I'm unsure how to better express this functionality. See Atlas 1.1 documentation Page 11 and Cartographer's readme artificial control code feature. There are cases where it's desirable to have an end token dumped where no end token hex sequence is available. During insertion, the end token indicates end of string for pointer calculations, yet no actual hex sequence needs to be inserted.

Quote:
2.6

Tau, you mentioned no ',' in the table id string. Why not? All characters should be valid here. The only requirement is it starts with '!'.

Quote:
2.6
I would like to stress the round-trip case some more.

As I had it in mind while designing this, that a table change wouldn't automatically invalidate the old table change. So changing from a table 1 that is limited to a number of matches (like 3) to another table 2 that is limited, too (like 5), while not being really feasible, would be possible.

However the match would itself count as a match in table 1. I realize this is a borderline crazy case, but I think that this is the only way all cases can be accounted for.

I think epistemic problems only arise when changing from table 0 to limited table 1 and then in there change to a limited table 0 again. Practically, it's not easy to tell one way or the other, so I'd think this would stay the same as the other cases for ease of implementation. If some crazy game out there does something like that, it's too bad.


I'm not following this example. Please explain.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #34 - Oct 27th, 2010 at 1:06pm
 
Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I am unsure how to express what I want to say here. You've got a given a longest hex sequence of 3 bytes.  The next 3 hex bytes are say $9b7733. If $9b7733 is not found, you search for $9b77. if $9b77 is not found, you search for $9b. If $9b is not found, you finally output <$9b>. Next iteration you start searching for $7733XX That's how dumping should work. I'm not sure how to express that here.

I know what you want to say here. I would mention here that only ever one-byte misses are possible, because finding no match for any entry at offset n does not imply there is no entry that fits offset n+1.

Quote:
In the event no table match is found for a given hex sequence, the first byte of the sequence must be output as a raw value in the following matter:
[...]
Note: This directly follows as there might be a matching entry for the hex sequence starting at its second byte.

The search paradigm of taking the longest match to a hex sequence is dealt with in 2.2.3.

Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I'm unsure how to better express this functionality. See Atlas 1.1 documentation Page 11 and Cartographer's readme artificial control code feature. There are cases where it's desirable to have an end token dumped where no end token hex sequence is available. During insertion, the end token indicates end of string for pointer calculations, yet no actual hex sequence needs to be inserted.


I find the current wording not too bad. It explains why this is desirable and demonstrates this. If I had to word it, I would only cut that run-on sentence to two separate sentences:
Quote:
An end token can be defined having no actual hex sequence associated with it.
Insertion utilities can employ these tokens as end-of-string indicators, e.g. for various pointer calculations and fixed-length strings. No actual hex sequence will be inserted in these cases.
The following format is acceptable in these situations.


Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
Tau, you mentioned no ',' in the table id string. Why not? All characters should be valid here. The only requirement is it starts with '!'.

I thought of this more in the context of generalizing table changed and linked entries. You don't allow U+002C COMMA in linked entries' names ― most likely for easy parsing of the linked entry syntax:
Quote:
$hexadecimal sequence=label,decimal number


Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I'm not following this example. Please explain.


The example is as follows: Suppose we have three tables. Table 0, 1 and 2. Suppose they have table Ids as indicated below and normal entries for all hex values 0x00 through 0xBF.

Table 0 contains
C0=TABLE1,3
. Table 1 contains
C1=TABLE2,5
. Table 2 is not so important for this example, but does not change tables anymore for the sake of simplicity.

Now we have the following parsing dilemma:

...

The following two solutions come to mind:

  • Solution I: A table change counts towards the number of table matches total:
    ...
  • Solution II: A table change does not count towards the number of table matches total, instead it doesn't count at all:
    ...

I think the most rational choice out of this dilemma is solution I.

However, as I implied in my original post, there might be times when a games switches between tables inside each other and expects these changes to basically reset some sort of "table stack". However, I have yet to see any games actually do that, so I'd say implementing solution I is pretty safe and the above example seems constructed to specifically break this implementation as it doesn't seem sensible to me to implement table switching that way.

So I hope to have cleared that up Smiley

cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #35 - Oct 28th, 2010 at 3:29pm
 
Ok. I tried to fix up 2.2.1 and 2.2.2 a bit with the wording on only one byte/character misses being possible. I didn't want to get too wordy for fear of further over-complicating a simple matter.

I made a note on no comma for TableID. I was only thinking about the TableID line and wasn't thinking about the TableID being also used on the table switch entry line, which of course can't have a comma without complicating parsing. Doh! *headsmack*.

I agree with the table switching itself being counted toward the number of matches. In fact, that's how I thought it was already implied to work. I added a small note to make it clear.


Embedded Pointers:

One new item came to the table recently as I was working on TextAngel, compatibility with Atlas, and support of this table format. The one big thing Atlas handles that my dumper doesn't is embedded pointers. Maybe this doesn't belong, but it's very similar to linked entries and it could possibly be very useful. I thought it was worth thought and consideration if nothing else.

It's kind of like a 'linked entry' where a control code is encountered and one or more placeholders for embedded pointers would follow. I can see this defined in a similar manner. Say

FC=<yes/no>,2

<yes/no> control is encountered, we know there will be two embedded pointers afterward and output "<yes/no>#EMBSET(0)#EMBSET(1)" to the script file. That would get us the placeholders (#EMBSET commands) However, I can't think of a sensible way to define to the dumper how to figure out when to write to the place holders (To get the #EMBWRITE commands). Atlus examples suggest you would just use the next end tokens. However, when you hit an end token, how do you know whether an embedded pointer should be written or a normal pointer in the larger table? It seems like you just write to the embedded table until it's exhausted (no more defined #EMBSETs to write to)and then fall back to the main pointer table.

I don't know how flexible that would be. I think I recall seeing games with a similar yes/no embedded setup, but different behavior.

Anyway, this might be something to consider trying to define for the table format? It might be a large leap forward as no available dumpers handle this in any capacity that I know of. It would be a dumping issue only. We obviously don't want to get into pointer specific information in the table file though. I was trying to keep it abstracted, however it seems the number of pointers and number of bytes will come into play. I'm not sure how it could be done otherwise.

Thoughts?
Back to top
« Last Edit: Oct 28th, 2010 at 5:44pm by Nightcrawler »  

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #36 - Nov 1st, 2010 at 9:31am
 
Nightcrawler wrote on Oct 28th, 2010 at 3:29pm:
Anyway, this might be something to consider trying to define for the table format? It might be a large leap forward as no available dumpers handle this in any capacity that I know of. It would be a dumping issue only. We obviously don't want to get into pointer specific information in the table file though. I was trying to keep it abstracted, however it seems the number of pointers and number of bytes will come into play. I'm not sure how it could be done otherwise.

I think this is a big minus. It's very specific to add pointer information and it might also be console specific. I have personally never encountered a game that uses this format on the GBC, because most separate control flow from the text shown.
Maybe the programming paradigm was different back in SNES days?

Also, personally I think defining this as a linked entry and then going over it with RegEx or a Beanshell script is much more flexible and accessible.

This really boils down to your perception of a table file.

Quote:
[A table file]'s sole purpose is to act as a hex to text and text to hex encoding file. Basically, this means it's a table to turn binary data into text and vice versa.

I think including pointers in there does not match this definition anymore. It would not only translate text to hex and vice versa, it would also need translate logic embbedded in the text.

To reiterate, personally, I think we'd be better off letting those things be a linked entry and then formatting files with regular expressions/Beanshell scripts. That way you
  • have no hassle with including pointer formats and lengths in the table file, and
  • can leave the further processing up to the user via much more dynamic processes including scriptable conversions.

For instance, Beanshell is basically Java. So you can use its RegEx capabilities to capture groups, modify the data contained in there however you like, put the string in a stringbuilder and output that.
Using specific ATLAS commands will only glue the user to having to use ATLAS for insertion or somehow get the data that was replaced by EMBSET back if he cares for literal data.

EDIT: Is there any particular reason why entries with blanks on the right-hand side are still illegal?

I think they're perfectly acceptable, for dumping and for inserting.

cYa,

Tauwasser

Edited:
Comment added.
Back to top
« Last Edit: Nov 3rd, 2010 at 3:27pm by Tauwasser »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #37 - Nov 5th, 2010 at 11:02am
 
Tauwasser wrote on Nov 1st, 2010 at 9:31am:
I think this is a big minus. It's very specific to add pointer information and it might also be console specific. I have personally never encountered a game that uses this format on the GBC, because most separate control flow from the text shown.
Maybe the programming paradigm was different back in SNES days?

Also, personally I think defining this as a linked entry and then going over it with RegEx or a Beanshell script is much more flexible and accessible.

This really boils down to your perception of a table file.


Yes. I agree. I have given this more thought. It doesn't belong shoe-horned in the table. Even if the definition is stretched, there's no way to abstract the details enough and adding anything Atlas (or any utility) specific is not desirable.  It is indeed better handled with post processing.

Quote:
EDIT: Is there any particular reason why entries with blanks on the right-hand side are still illegal?

I think they're perfectly acceptable, for dumping and for inserting.


They are illegal because they invalidate the map. You cannot map something to nothing or nothing to something.

$FE=
$FC=

What does this example say? Hex sequence FE and FC map to 'nothing' for dumping. And we're saying 'nothing' maps to FE and FC for insertion. This is an illogical fallacy. It's definitely illegal for insertion direction. It has two of the same text sequences mapping to different hex sequences. If it's not part of the map, why it is in the table file? Klarth and I thought both invalid map sequences like this, and unrecognized lines in general, be illegal and generate error.

At best, it could be valid for dumping direction only depending on your perception of 'nothing'. I suppose it's the only way to skip processing of a particular hex sequence altogether and output nothing at all. Actually, that may have been my original intention, to make invalid in one direction only (not totally invalid like it says).


With that explanation, do you still think they are perfectly acceptable? If so, why?
Back to top
« Last Edit: Nov 5th, 2010 at 2:58pm by Nightcrawler »  

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #38 - Nov 7th, 2010 at 12:13pm
 
Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
They are illegal because they invalidate the map. You cannot map something to nothing or nothing to something.

First off, they do not map "something to nothing". They map a hex sequence to the empty string and thus do not invalidate the map. Secondly, I have seen games that do this.
Now I will pull a Pokémon example out of my hat again, but 0x35 does exactly this in Gold/Silver. They are simply skipped.
What purpose does this serve?
  • You can implement variable-width player names without much hassle. Many NES games, for instance, tend to print a fixed 5 characters for the name. This results in redundant whitespace.
    However, if the name is stored with ignorable characters to fill the five character limit, all printing will be correct the first time around, without much legwork.
  • You can programmatically fix British English to be American English for "-our/-or" cases like colour vs. color.
    This way, you don't have to QC the assembly again, because the script will have the same size in bytes (and indeed, the replacement could even be made in a compiled image).

Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
This is an illogical fallacy.

That one made me snicker Smiley

Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
It's definitely illegal for insertion direction.

As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.
This is even evident by the current wording of the spec:

Quote:
Expected behavior in the event no table entry is found to match a given text sequence is to ignore the character and make no hex insertion.

You want to ignore a character, so the empty string could not have been included in the search for a map in text ⇒ hex direction to begin with, as it contains no characters.

Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
If it's not part of the map, why it is in the table file?


How does this invalidate a map? It's still a map, just not a injective one.

Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
With that explanation, do you still think they are perfectly acceptable? If so, why?


I hope I have argued my point of view and why I think its perfectly acceptable.


I just noticed, that the doc still has "No Entry Found" instead of "No-Entry-Found" in most places. It's even inconsistent with the index. Indeed, it says "[...] Insertion" in the index while it says "[...] Inserting" in the body.
And yes, I am pedantic Smiley However, I'm just trying to help improve the spec and get common cases "under the lid" here.

cYa,

Tauwasser

Edited:
Yeah, I meant injective, not surjective. Sorry about that. I think that part was clear though.
Back to top
« Last Edit: Nov 14th, 2010 at 9:44am by Tauwasser »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #39 - Nov 9th, 2010 at 1:52pm
 
Tauwasser wrote on Nov 7th, 2010 at 12:13pm:
First off, they do not map "something to nothing". They map a hex sequence to the empty string and thus do not invalidate the map. Secondly, I have seen games that do this.


I figured that's what your perception of 'nothing' would be when I wrote my last response. This is only valid in a limited sense. An empty string is a text sequence like any other. Let's look at this:

35=
7B=

That's fine for dumping. However, you can have only ONE of those type of values without issue for insertion. Otherwise, if the table is loaded in the insertion direction, you have an (illegal) duplicate where the same text sequence maps to different hex sequences.  You can't have duplicates in a valid hash map.

See the issue there? You can't use that same table for insertion.

Opposite counterpart
=test

Besides the same issues as the other case, this is illegal because an 'empty' hex sequence is malformed and not allowed. Also, we did define when no insertion match for a text sequence can be made, it is ignored, which is the same desired end result.


Quote:
As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.


Right. You just reiterated my point. You just said it is not included in the map. It's illegal for the map in one direction, and anything illegal for the map generates error.

You seem to suggest making special exception during processing to ignore that type of line for insertion, while I believe no special exception should be made, and it should generate an error when loaded in that direction, because it is not part of the map.

Quote:
I hope I have argued my point of view and why I think its perfectly acceptable.


I understand your point of view, however I believe it is valid in one direction only (or in both directions if there is only one such entry of this type). In the cases where it is not valid, I would expect an error. You can still do what you want, you just can't load the same table in the dumping and inserting direction when they have duplicates like that. In fact, the same concept is applied to any duplication in hex or text sequences, which I forgot to include in the document.

In summary, what I'm saying is I would propose this amendment to make it more clear:

Quote:
2.2.5 Illegal Sequences:
     
           Duplicate Entry:
           
                 00=test
                 00=test
                 
                 Full Duplicate entries are not allowed and shall generate error.
                 
                 00=test
                 01=test
                 
                 Duplicates text sequences are not allowed when the table is loaded in the inserting direction.
                 The same text sequences cannot map to multiple hex sequences for inserting.
                 
                 00=test
                 00=test2
                 
                 Duplicate hex sequences are not allowed when the table is loaded in the dumping direction.
                 The same hex sequence cannot map to multiple text sequences for dumping.
                 
           Blank or Empty entries:
           
                 00=
                 01=
                 
                 Multiple blank text sequences are not allowed when the table is loaded in the inserting direction.
                 The same 'empty string' text sequence cannot map to multiple hex sequences for inserting.
                 
                 =test
                 
                 A 'blank' hex sequence is not allowed.


What do you think about that?

Quote:
I just noticed, that the doc still has "No Entry Found" instead of "No-Entry-Found" in most places. It's even inconsistent with the index. Indeed, it says "[...] Insertion" in the index while it says "[...] Inserting" in the body.
And yes, I am pedantic Smiley However, I'm just trying to help improve the spec and get common cases "under the lid" here.


Corrected in my working copy. I don't mind the feedback. This document isn't that popular, so few are interested in reading it over in detail.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #40 - Nov 10th, 2010 at 10:05pm
 
Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
Let's look at this:

35=
7B=

That's fine for dumping. However, you can have only ONE of those type of values without issue for insertion.


See below.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
You can't have duplicates in a valid hash map.


I argue very much against this point. It would just be a map to a string array, which is perfectly valid.

Of course, using a naïve implementation would also not break the hash map. Most language I have used replace values when another entry with the same key is "added".
So then the insertion process would be determined by chance:

Code:
3B=Test
7B=Test 



Might insert 0x3B, or 0x7B, depending on the programming language, the order the table is read and the will of the programmer. However, from a practical point of view, since both of these hex sequences map to exactly the same "Test" in reverse-lookup, there is no problem using only one of these values for all the occurrences of "Test". If there were a problem, the hex sequences cannot mean the same "Text" under all circumstances are therefore not to be considered the same (so a premise is invalidated).

More problematic would be the following:

Code:
3B=Test
7B7B7B7B7B=Test 



This would greatly increase the script size upon insertion, however, it would not break the hash map with the naïve implementation above.

A clever implementation might implement a map from strings to a hex sequence array and use the shortest sequence available or alternatively do some simple math while reading the table (or when "adding" doesn't replace the value, but throws an exception) like so:

Code:
        'Regular expression for matching
        Dim regObj As Regex = New Regex("^([a-fA-F0-9]{2,})=(.*)$")
        Dim matchObj As Match = Nothing
        'ReadLine, init to != Nothing
        Dim readLine As String = ""
        Dim hexGroup As String = Nothing
        Dim textGroup As String = Nothing

        'Prep hashtables, case-sensitive
        hexTextHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)
        textHexHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)

        Do Until readLine Is Nothing

            readLine = reader.ReadLine()

            'Don't care about empty lines
            If (Not String.IsNullOrEmpty(readLine)) Then

                'Match using regEx
                matchObj = regObj.Match(readLine)
                'Get at least two groups, or skip!
                If (matchObj.Success) Then

                    'Match Group 0 is entire regex Match
                    'Match Group 1 is hexadecimal
                    'Match Group 2 is text

                    'Make sure we got an even number of hex digits
                    hexGroup = matchObj.Groups(1).Value.ToUpperInvariant
                    textGroup = matchObj.Groups(2).Value
                    If hexGroup.Length Mod 2 <> 0 Then Continue Do

                    'Add to tables
                    hexTextHashTable.Add(hexGroup, textGroup)

                    ' If (not (is in table)) OR (isin table, but value in table is longer hex sequence)
                    If (Not textHexHashTable.ContainsKey(textGroup) OrElse textHexHashTable(textGroup).ToString().Length > hexGroup.Length) Then

                        'This will not throw an exception when key textGroup is already in dictionary.
                        textHexHashTable(textGroup) = hexGroup

                    End If

                End If

            End If

        Loop
 



This is a working Unicode-supporting table class I actually wrote some days ago. Notice while I'm using Dictionary(Of String, String), I might as well use Hashtable. Hashtable is not typesafe in VB.NET, Dictionary is. The syntax is exactly the same for both, meaning it won't throw exceptions either.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
Tauwasser wrote on Nov 7th, 2010 at 12:13pm:
As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.


Right. You just reiterated my point. You just said it is not included in the map.


Nope. I just said it is not included in the search! It can be included in the map. The point here is, you never look for the empty string in the first place, because it is empty.
Therefore, it doesn't matter whether it is included in a map of not, because you will never look for it, much like you never look for a text match for the hex sequence of length zero.
Therefore the special case of having a non-injective set of table entries for the empty string text sequence coincides with the case of having any non-injective set of table entries (on the text side).

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
Otherwise, if the table is loaded in the insertion direction, you have an (illegal) duplicate where the same text sequence maps to different hex sequences.
[...]
It's illegal for the map in one direction[...].


Just to reiterate, it is not illegal. There is nothing per sé illegal about a map that is not injective in this case.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
You seem to suggest making special exception during processing to ignore that type of line for insertion, while I believe no special exception should be made, and it should generate an error when loaded in that direction[...].


I suggest that the empty string does not impose the need of special exceptions for insertion because it was never included in the insertion problem to begin with.
I would, however, suggest that special exceptions can be made for non-injective entries in general, like I have shown above.
One basically does not need to care about these cases at all in most programming languages and the creation of a hashmap will work. This comes with the trade-off that the length of the mapped hex sequence for any string might not be the shortest in the table.
However, basically one simple line in a very basic implementation can already take care of this scenario without the need to impose somewhat complex algorithms and heuristics (as are needed for an optimal insertion with multiple tables for instance).

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
I believe it is valid in one direction only (or in both directions if there is only one such entry of this type).

You come back to non-injective maps being "illegal" over and over again. Thereby, you exclude a very common use case! Off of the top of my hat, RHDN sometime last week: Here. Granted, the user will probably not need to ever import the Japanese script back into the game, however, you want to make him dump the script with "物1" and "物2" just to search-and-replace this then in all his script files? C'mon!
Also, the ease with which I could come up with a real-life example should frighten you, because it could mean a v1.1 of the spec rather soon after its release.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
[Y]ou just can't load the same table in the dumping and inserting direction when they have duplicates like that.

Well, I have shown above that already a very simple implementation can handle this problem ― and it isn't the most simple, which I have also mentioned: For a by-chance implementation, you would not even need to check if a given key is already in the hashmap. You just need to add it to the hashmap and the programming language's standard implementation will silently overwrite any duplicate. It doesn't get more easy than that from a practical point of view.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
In fact, the same concept is applied to any duplication in hex [...] sequences [...].

The hex sequence cannot have duplicates per definitionem, I'm not arguing those cases.

In all actuality, I thought that was what you meant with your example:
Quote:
Duplicate Entry:
           
                 00=test
                 00=test
                 
                 Duplicate entries of any kind are not allowed and shall generate error.


Now over the course of this dialogue it seems to surface that this paragraph is not only talking about duplicate hex sequences ― how I interpreted it before ― but actually also about duplicate text sequences as well, which I had not assumed.

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
What do you think about that?

I still find exclusion of non-injective tables as well as text sequences containing only the emptry string to be an unnecessary limitation of this spec and therefore to be null and invalid.


I can see the following use cases per entry of a table file:
  • Hex and text sequences are unique:
    There is no problem identifying corresponding pairs in any hashtable implementation. The empty string is included.
  • Hex sequence is unique, text sequence is duplicate:
    There is no ambiguity in dumping direction. The empty string is included. There is ambiguity in insertion direction. I have shown above how this can quickly and sensically be resolved with a simple heuristic of taking the shortest hex sequence in the table for insertion. Either the text sequences are identical under all circumstances and therefore it does not matter which hex sequence is inserted, or they are not identical and the hex sequences should therefore not map to identical text sequences to begin with.
  • Hex sequence is ambiguous, text sequence is unique or ambiguous:
    Ambiguous hex sequences are forbidden per definitionem as no good case can be made for preferring one hex sequence over another hex sequence based on their text sequences. This case is therefore to be forbidden.
  • The hex sequence is the empty hex sequence, text is unique of ambiguous:
    The empty hex sequence is nonsensical, because it is always a prefix sequence of itself. Therefore there would be an unlimited number of empty hex sequences to be dumped between two bytes of the file in question. This case is therefore to be forbidden.


Before you try to make a case against the empty string out of the argument that it is always a prefix of itself (which is true): It is still not included in the insertion problem, because it has length 0 and only prefixes of length ≧ 1 are included in the insertion problem.

I hope I have shed some light onto my reasoning and why I think the non-injective-table case can be dealt with in an easy, comprehensive and logical fashion.

cYa,

Tauwasser

Edited:
Edited mistake in example code. Some language refinements.

Edited:
Some more explanation regarding 物 duplicate over at RHDN.

Edited:
Yeah, I meant injective, not surjective. Sorry about that. I think that part was clear though.

Back to top
« Last Edit: Nov 14th, 2010 at 9:44am by Tauwasser »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #41 - Nov 11th, 2010 at 9:50am
 
Tauwasser wrote on Nov 10th, 2010 at 10:05pm:
I argue very much against this point. It would just be a map to a string array, which is perfectly valid.

Of course, using a naïve implementation would also not break the hash map. Most language I have used replace values when another entry with the same key is "added".


Really? What languages? All the .NET languages, C++, and Java don't.

See Dictionary.Add()
"ArgumentException - An element with the same key already exists in the Dictionary<TKey, TValue>."

C++ does not either.
"Because map containers do not allow for duplicate key values, the insertion operation checks for each element inserted whether another element exists already in the container with the same key value, if so, the element is not inserted and its mapped value is not changed in any way."

Java does not either.
"An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value."

There are other ways you can accomplish the map and allow duplicates, but why make it more difficult, be less efficient, and likely require more code? A basic hash map is all that is needed. Nearly all implementations I know of don't like duplicate keys.

Quote:
So then the insertion process would be determined by chance:


I don't like that at all. No behavior should be undefined.


Quote:
A clever implementation might implement a map from strings to a hex sequence array and use the shortest sequence available or alternatively do some simple math while reading the table (or when "adding" doesn't replace the value, but throws an exception) like so:

This is a working Unicode-supporting table class I actually wrote some days ago. Notice while I'm using Dictionary(Of String, String), I might as well use Hashtable. Hashtable is not typesafe in VB.NET, Dictionary is. The syntax is exactly the same for both, meaning it won't throw exceptions either.


Your code illustrates the extra processing required to avoid trying to add duplicate keys to the dictionary. My aim is to stay away from needing 'clever' implementation and head toward sheer simplicity. Most programmers in our community are struggling self learned ones. Why shouldn't you just be able to just split the table line and stick it in the dictionary as-is for all normal entries? Obviously we have a few hoops such as checking for blank lines and validating the even hex, but why keep adding more?

Instead, I think the table should follow a hash map from a conceptual point of view. Allow duplicate values, but not duplicate keys. Keep all cases clearly defined. Why do we have to start making exceptions and adding undefined/chance behavior? I'm just not seeing the need. You can still accomplish what you want to accomplish. You just need a different table for dumping and inserting if you intend to use tables that would result in key duplication when processed in the other direction.


Quote:
You come back to non-surjective maps being "illegal" over and over again. Thereby, you exclude a very common use case! Off of the top of my hat, RHDN sometime last week: Here. Granted, the user will probably not need to ever import the Japanese script back into the game, however, you want to make him dump the script with "物1" and "物2" just to search-and-replace this then in all his script files? C'mon!
Also, the ease with which I could come up with a real-life example should frighten you, because it could mean a v1.1 of the spec rather soon after its release.


You misunderstand how this situation is handled. I haven't excluded any use case, including this one. There would be a difference between the table used for dumping and insertion.

Dumping Table:
01=物
02=物
03=物

There's no problem. There can be duplicate values. You can have as many as you want map to that Kanji.

Inserting Table:

02=物

For inserting, you need to have ONE so it is clearly defined what 物 should map to, and would not cause any duplicate key scenario.

Does that make sense? You can do exactly what you want, you just can't dump and insert with the exact same table. That's probably already the case for many utilities. That is how these situations can be handled and still conform to the paradigm of a hash map. We do not need to break it and add additional exceptions in code.

Quote:
I hope I have shed some light onto my reasoning and why I think the non-surjective-table case can be dealt with in an easy, comprehensive and logical fashion.


I agree 100% with you that these use cases need to be accounted for. I've been around enough years to have certainly seen and used them myself.  They are handled in the manner I showed above with my RHDN case handling example.

It seems our disagreement is on how they should be handled. I really want to stick with following the paradigm of a hash map, simplify implementation, and have clearly defined behavior. The only negative is requiring table modification for dumping vs. insertion in those cases. However, that modification is just having the user clarify what they actually want done in the duplicate key situation.

You'd like to see the hash table paradigm broken in the table file and add the exceptions to the code so the internal hash table never sees them. The disadvantage is the chance behavior and increased code complexity. I understand you could define the behavior, but that would further require code. It would also lock the duplicate situation behavior. My method allows the user to define via table modification what should be done.

Do you agree with the advantage vs. disadvantage analysis of both?
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #42 - Nov 12th, 2010 at 10:01am
 
Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Really? What languages? All the .NET languages, C++, and Java don't.

I already showed above that VB.NET, C++.NET and C#.NET do support this!

MSDN HashTable

Quote:
You can also use the Item property to add new elements by setting the value of a key that does not exist in the Hashtable; for example, myCollection["myNonexistentKey"] = myValue. However, if the specified key already exists in the Hashtable, setting the Item property overwrites the old value. In contrast, the Add method does not modify existing elements.

So while the add method does not do this, it works as described in my example above using the item property! No more code is needed!

For C++ see here: Operator [].

Quote:
If x matches the key of an element in the container, the function returns a reference to its mapped value.

If x does not match the key of any element in the container, the function inserts a new element with that key and returns a reference to its mapped value.

Java: HashMap#put()

Quote:
Associates the specified value with the specified key in this map. If the map previously contained a mapping for this key, the old value is replaced.

All of the main programming languages, which you mentioned, support this scenario already!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
There are other ways you can accomplish the map and allow duplicates, but why make it more difficult, be less efficient, and likely require more code? A basic hash map is all that is needed. Nearly all implementations I know of don't like duplicate keys.

I have just shown that nothing you just said holds. It does not require more code, it works with a simple HashMap type in all these language, it does not duplicate keys, it's not ambiguous and it's not less efficient!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
I don't like that at all. No behavior should be undefined.

It's not an undefined process. The lengths of the hex sequence is determined, as mentioned above, how the programmer codes his routine (first and foremost)and by the order in which table entries are read in a simple implementation. There is nothing undefined about this. If the table mapping is correct, insertion will yield proper data, just not optimal data size-wise.

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Your code illustrates the extra processing required to avoid trying to add duplicate keys to the dictionary.

No my code shows how easy it is to select the shortest hex sequence for any given text sequence in the table!

The no-brain simple solution is the following (which I also mentioned in my last post):

Code:
        'Regular expression for matching
        Dim regObj As Regex = New Regex("^([a-fA-F0-9]{2,})=(.*)$")
        Dim matchObj As Match = Nothing
        'ReadLine, init to != Nothing
        Dim readLine As String = ""
        Dim hexGroup As String = Nothing
        Dim textGroup As String = Nothing

        'Prep hashtables, case-sensitive
        hexTextHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)
        textHexHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)

        Do Until readLine Is Nothing

            readLine = reader.ReadLine()

            'Don't care about empty lines
            If (Not String.IsNullOrEmpty(readLine)) Then

                'Match using regEx
                matchObj = regObj.Match(readLine)
                'Get at least two groups, or skip!
                If (matchObj.Success) Then

                    'Match Group 0 is entire regex Match
                    'Match Group 1 is hexadecimal
                    'Match Group 2 is text

                    'Make sure we got an even number of hex digits
                    hexGroup = matchObj.Groups(1).Value.ToUpperInvariant
                    textGroup = matchObj.Groups(2).Value
                    If hexGroup.Length Mod 2 <> 0 Then Continue Do

                    'Add to tables
                    hexTextHashTable.Add(hexGroup, textGroup)
                    textHexHashTable(textGroup) = hexGroup

                    End If

                End If

            End If

        Loop
 


Notice how duplicate keys on the hexadecimal side will throw an exception, while they won't throw an exception on the text-side. It doesn't get easier than that! And the insertion behavior is clearly not undefined.
This implementation will always insert for any text sequence the appropriate hex sequence that was specified as the last occurring entry of that text sequence in the table.
Doesn't seem so undefined to me. It's just not guaranteed to be optimal!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Instead, I think the table should follow a hash map from a conceptual point of view. Allow duplicate values, but not duplicate keys.

Which I have been arguing for all along, because you want to disallow it. And I have gone the extra mile and have shown how programs using the standard language implementation of a hashmap (typesafe or not) can easily and swiftly process these.

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
You can still accomplish what you want to accomplish.

Exactly, and with just one table! Also your usage of undefined to mean behavior that depends on order of operations is clearly problematic!
A simple implementation, code in VB.Net in this post, will be dependent on the order of operations while reading the table. This is not undefined behavior and doesn't trigger undefined behavior for insertion!

However, just one if-clause will make this code be optimal! See code in my previous post!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
For inserting, you need to have ONE so it is clearly defined what 物 should map to, and would not cause any duplicate key scenario.

Which is exactly what the if-clause does. Choose the optimal entry that yields the shortest hex sequence. Why can't this be automated? If somebody wants to insert with a special hex value, he can still do that. However, Joe Shmo can use his dumping table and be happy.

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Does that make sense? You can do exactly what you want, you just can't dump and insert with the exact same table.

Why shouldn't the user be allowed to have the comfort of dumping and inserting with the same table?

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
That is how these situations can be handled and still conform to the paradigm of a hash map. We do not need to break it and add additional exceptions in code.

And my reference implementation doesn't, fancy that! And it's still a hash table! And no unnecessary code was inserted!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
It seems our disagreement is on how they should be handled. I really want to stick with following the paradigm of a hash map, simplify implementation, and have clearly defined behavior.

Behavior is clearly defined, the languages' hashmap implementations can be used with simple code and it's a lot more comfort for the user. What have I missed?

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
The only negative is requiring table modification for dumping vs. insertion in those cases. However, that modification is just having the user clarify what they actually want done in the duplicate key situation.

For any real scenario, as I explained earlier, the keys wouldn't map to the exact same text sequence if they weren't the exact same text sequence to begin with! So I really see no point in duplicating work for the end-user here and not choosing entries that guarantee optimal insertion lengths resp. going with a simple implementation and having the user put the shortest hex-sequence towards the bottom of the map if he desires so.
If the user just cares that his script is inserted and the hex sequences really map to exactly the same text sequences, everything will be in order in any case, simple implementation or not!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
You'd like to see the hash table paradigm broken in the table file and add the exceptions to the code so the internal hash table never sees them. The disadvantage is the chance behavior and increased code complexity.

I have shown it does not require any additional code because many standard language hashmap implementations already work the way that multiple values with the same key just update the hash map.
The disallowed cases here should be the ones I mentioned in my last post.

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
I understand you could define the behavior, but that would further require code.

Again, it doesn't require any more code and I have shown that. Also, we keep coming and going back and forth between table file definition and ease of implementation.
I feel my proposition is true to a table file definition that was always there and has been used in many programs as well as ease of implementation!

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Do you agree with the advantage vs. disadvantage analysis of both?

Certainly not. I see no disadvantage for the user. I see the disadvantage of Joe Shmo having to create a duplicate table when the thing he most likely wants can be accounted for in one table!

Jeez, this is such a common scenario and a good case can be made for inserting the shortest hex sequence in cases of exact text sequence matches. I see this as needlessly over-complicating matters on your side.
I see this as a really simple extension to the idea of a hashmap and obviously so do major programming language APIs!
It seems you just don't want it to work, just because.

cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3304
USA
Gender: male
Re: 'Standard' Table File Format
Reply #43 - Nov 12th, 2010 at 7:22pm
 
Quote:
I already showed above that VB.NET, C++.NET and C#.NET do support this!


You've shown behavior of those implementations when you to try to stick a duplicate key in there. Since hash tables do not allow duplicate keys, those implementations replace. That still does not change the fact that hash maps do not allow duplicate keys to be stored.

Quote:
I have just shown that nothing you just said holds. It does not require more code, it works with a simple HashMap type in all these language, it does not duplicate keys, it's not ambiguous and it's not less efficient!


Correct, it does not duplicate keys because as I've said, a HashMap does not allow for duplicate keys. When you try to add one, it updates it.

Quote:
It's not an undefined process. The lengths of the hex sequence is determined, as mentioned above, how the programmer codes his routine (first and foremost)and by the order in which table entries are read in a simple implementation. There is nothing undefined about this. If the table mapping is correct, insertion will yield proper data, just not optimal data size-wise.


You just said in your last post "So then the insertion process would be determined by chance:" Anything determined by chance is undefined. Shocked

Quote:
Notice how duplicate keys on the hexadecimal side will throw an exception, while they won't throw an exception on the text-side. It doesn't get easier than that! And the insertion behavior is clearly not undefined.
This implementation will always insert for any text sequence the appropriate hex sequence that was specified as the last occurring entry of that text sequence in the table.
Doesn't seem so undefined to me. It's just not guaranteed to be optimal!


Yes, I understand. I'm not sure I like table file order dictating what a value should be mapped to. Order should serve no purpose in a hash map. Secondly, if that last occurring entry is undesirable, you'd still have to alter your table to make it work the way you want. In that case, you still need an altered/second table, which would be no different than my solution.

Lastly, it would be undefined behavior unless it was explicitly put in the table standard that last occurring entry takes precedent when there is a duplicate key conflict in the table. Is that what you're proposing?

Quote:
Why shouldn't the user be allowed to have the comfort of dumping and inserting with the same table?


They should be if it's possible. In this case, it's only possible if you rely on a somewhat obscure and non-intuitive (in my opinion) rule that the last occurring entry takes precedent. An end user might surely wonder what would occur having the same text sequences mapped to different hex sequences. They'd have to go looking for specification on this case. With the way I had it, it was completely clear. Only the one you want is allowed. No guesswork. If the user wants to insert, the user clearly tells it what to map to. No reliance on a special rule like the last occurrence.

Quote:
Again, it doesn't require any more code and I have shown that. Also, we keep coming and going back and forth between table file definition and ease of implementation.

I feel my proposition is true to a table file definition that was always there and has been used in many programs as well as ease of implementation!


Yes, I concede the operator functionality requires no additional code. I did not think about using the implementation in that manner. I'm glad you brought that to my attention. I am very used to using Dictionary.Add which throws exception.

I do go back and forth with definition and ease of implementation. It's important to me. First, I would never develop a standard I didn't feel comfortable with implementing. If I don't want to program it, I wouldn't bother working on and releasing it. Second, many in our community are struggling self learned programmers. They're lucky to be able to program anything at all. So, if this has any shot of being adopted at all, it needs to be simple enough that some of those guys might be able to grasp it (Think IPS vs. other patching formats). I fear as soon as we added Table switching though, it probably became out of reach for many of those people anyway. Sad So, I am in a struggle bouncing around between what's best, definition, and ease of implementation, and my own personal preference (I'm shepherding this, so I had better like it.). It's a tough balancing act for sure.



P.S. I'm off to Hawaii tomorrow. I will give further consideration and thought on the plain. I won't be able to respond further until the 25th or so.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #44 - Nov 13th, 2010 at 11:02am
 
Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
You've shown behavior of those implementations when you to try to stick a duplicate key in there. Since hash tables do not allow duplicate keys, those implementations replace. That still does not change the fact that hash maps do not allow duplicate keys to be stored.


And I never argued that they would allow duplicate keys. You came up with that. Surely they cannot map identical keys to different values, however, "adding" (or appending or whatever an item-operator would be) will work just as well for duplicate keys as it does for unique keys.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
You just said in your last post "So then the insertion process would be determined by chance:" Anything determined by chance is undefined. Shocked


Bad wording. What was really meant was that it is dependent on the order the table file is read. I don't support using this naive solution though, I merely argue that it works well for people that just want to get their script inserted, who don't care about optimal script size.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Yes, I understand. I'm not sure I like table file order dictating what a value should be mapped to. Order should serve no purpose in a hash map.


Therefore I propose to take the optimal solution: The shortest hex sequence for any given text sequence should be mapped from text to hex. This can be done with a one-line if-statement.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Secondly, if that last occurring entry is undesirable, you'd still have to alter your table to make it work the way you want. In that case, you still need an altered/second table, which would be no different than my solution.


Why would there be undesirable entries in the table to begin with? Surely, if I don't want 0x3B dumped as "Test", I would not include it?
If "undesirable" is supposed to mean that one hex sequence (optimal or not) must not be chosen for insertion of "Test", then this really boils down to my argument from before:
The two "Test" strings are not identical and therefore the table should not map two different hex sequences to "Test" to begin with.
This is therefore really the responsibility of the guy making the table file.
A hex sequence that doesn't print the exact same data as another hex sequence should not be included in the table file as mapping to the same text sequence. This is exactly the same case as the following:

Code:
00=0
01=0 



While 0x01 in the game actually prints "1", not "0". If I now were to insert "0", I could (and indeed should) choose the optimal hex sequence length, which is now either of the two. However, if 0x01 does not mean exactly what 0x00 means, the table is invalid in itself and no further argument needs to be presented IMO.
If a table file faithfully represents what the ROM maps values to and from, everything that is inserted will be valid.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Lastly, it would be undefined behavior unless it was explicitly put in the table standard that last occurring entry takes precedent when there is a duplicate key conflict in the table. Is that what you're proposing?


I propose putting in there that for any duplicate text sequence one should try to insert the optimal-length hex sequence. However, really, not putting a guideline in there will not hurt from a practical point of view either, as implementations will still work and produce valid data ― presuming the table file is correct.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
In this case, it's only possible if you rely on a somewhat obscure and non-intuitive (in my opinion) rule that the last occurring entry takes precedent. An end user might surely wonder what would occur having the same text sequences mapped to different hex sequences. They'd have to go looking for specification on this case. With the way I had it, it was completely clear. Only the one you want is allowed. No guesswork.


I think minimizing the script size is not guesswork but a good (if not optimal) heuristic for many games on older platforms since this is a linear problem (so minimzed length of each constituent of the data will result in minimized length of the whole data). Of course, if the user happens to want to insert a longer sequence, he still can edit the table file for insertion. However, this is likely not the case for 95% of all users.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
I do go back and forth with definition and ease of implementation. It's important to me. First, I would never develop a standard I didn't feel comfortable with implementing. If I don't want to program it, I wouldn't bother working on and releasing it. Second, many in our community are struggling self learned programmers. They're lucky to be able to program anything at all. So, if this has any shot of being adopted at all, it needs to be simple enough that some of those guys might be able to grasp it [...].


First off, I think all programmers are self-taught. I have yet to experience a programming lesson where I actually see people learning stuff. Most of the times at uni, you either are already a programmer and don't get anything out of the hundredth explanation of data types, or you're a newbie and don't get enough experience out of the programming project to do something for you...

So I think being self-taught is not the problem here. However, I have often times exasperated at how unwilling ― or maybe inept ― some people are when it comes to reading the API. I have seen people reinvent the wheel so many times when literally one swift look at the API would have solved their programming misery and saved them days of work by using an API implementation.

I think I know what you're getting at. There are several options with the propositions as they are on the table:
  • Don't specify how this problem is to be solved. A naïve implementation will be correct and work.
  • Specify that heeding the order from top to bottom when the table is read, where new values for keys update old values for the insertion hash map, must be implemented. This will give the user control so he can rearrange his table accordingly.
  • Specify that the new values for keys update old values for the insertion hash map iff the new values are shorter than the old values. This will be optimal behavior for insertion as the resulting script will be inserted with the smallest size possible. If a user does not want this to happen, which should be the exception presuming their table files are correct to begin with, they can still create different tables for dumping and inserting and have total control.

I would go with the latter, simply because it seems to be the most reasonable use-case of all of them.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
I fear as soon as we added Table switching though, it probably became out of reach for many of those people anyway.


Well, not necessarily. With .NET Linq-Queries over arrays and datatables, it should really be a breeze to find the shortest hexadecimal sequence for a given string out of all tables. Can't say for C++, since I haven't program long enough in it to have been confronted with querying data tables.
There are several libraries out for Java which add Linq-like support to it. So collections can be searched.

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
P.S. I'm off to Hawaii tomorrow.


Happy Holiday Cheesy Chill out and enjoy sweet life while it lasts.

cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Pages: 1 2 3 4 5 
Send Topic Print
(Moderator: Nightcrawler)