Nightcrawler's Message Board
http://transcorp.romhacking.net/forum/YaBB.pl
Rom Hacking/Translation Board >> Rom Hacking/Translation Board >> Project - Table File Standard
http://transcorp.romhacking.net/forum/YaBB.pl?num=1273691610

Message started by Nightcrawler on May 12th, 2010 at 3:13pm

Title: Project - Table File Standard
Post by Nightcrawler on May 12th, 2010 at 3:13pm
During development of my some of my own tools, and examination of existing tools such as Cartographer, Atlas, Hexposure, a silly thought occurred to me.  Why don't I start crusading for some sort of standardization of the table file format and see what comes of it. At worst, I'd have a written standard that I follow for all of my tools. With a little luck, a few people will jump on board and we can take a small evolutionary step forward in program compatibility and table feature support. As luck would have it I did find some interested parties. The most notable being Klarth, the author of the insertion utility, Atlas. It's been tossed around for quite some time now, but it will be worth the wait! Read on!

There's currently no real standard format. There are quite a few differences from utility to utility on how line breaks, end tokens, linked entries, hiragana/katakana, control codes, bookmarks, etc. are handled in table files. I thought it would be a good idea to create a standard that can be used going forward for table files where they are interoperable without change amongst utilities. Obviously as much backwards compatibility as possible would also be a goal. However, we also need to take aim at taking an evolutionary step forward and enhancing our feature set. It's always a tough balance, but I think the result is approaching something nice. :)

The document acts a reference to the file format, explanation for newcomers, and tips for programmers. I don't think we've ever had anything like this on the subject before. :)

Any thoughts on this?

5 July 2016 DRAFT

Title: Re: 'Standard' Table File Format
Post by KaioShin on Jun 3rd, 2010 at 11:04am
It sounds like a great idea, though I think there is one big problem: if you really wanna fix all problems of the old format mayhem it'll be impossible to maintain compatibility with the old format. At least I can't see how it would work. At the very least one of the biggest hurdles to begin with is SJIS vs Unicode. A modern table format just shouldn't use SJIS, but that's what the old tables mostly used. Though it was never specified anywhere that the file needs to be encoded in SJIS (IIRC), almost all old tools expect it in that encoding and will crash on unicode files.

Any thoughts on how to resolve that?

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 3rd, 2010 at 11:26am
Table File encoding never being specified is just half the cause of the problem. The other half is the complete disregard for encoding support in any of the utilities. Now I certainly understand why. Prior to recent times, supporting multiple encodings, especially Unicode in your utility was difficult. It's much simpler today with advancements such as .NET.

It makes sense to use UTF-8 for the table file encoding standard. In this day and age with many languages used for translation, backwards ASCII compatibility, and it being widely adopted, it seems like the obvious choice. I'm not sure it would make sense to use anything else.

However, I will say I plan to support the most common encodings in my dumper. Probably just UTF-8, S-JIS, EUC-JP, and ASCII. That should help the cause a little bit. Though there's the school of thought that the clean break is better and backwards compatibility just encourages people to hang on to the old. It's always a fine line between ushering a new standard and getting people to use your stuff so it takes off. It's probably best to start with support and take it out later or something.

Anyway, that's more for the dumper. UTF-8 makes sense to declare as the encoding of choice for the table file format. However, some might argue that the table file format doesn't need an encoding specified and it's up to the dumping/inserting utility to dictate. I think UTF-8 it is though since we need to start ensuring compatibility. You shouldn't have to alter your table every time you use a different utility to bend to it's individual will.

Title: Re: 'Standard' Table File Format
Post by KaioShin on Jun 3rd, 2010 at 2:01pm

Nightcrawler wrote on Jun 3rd, 2010 at 11:26am:
. I think UTF-8 it is though since we need to start ensuring compatibility. You shouldn't have to alter your table every time you use a different utility to bend to it's individual will.


Couldn't agree more.

If a new format was created from scratch, have you considered making it XML based? Mandatory entries for each entry would be the table number and the value, optional additional attributes can be used for things like bookmarks. And the value field can be of any datatype so it wouldn't be a problem to put pointers, letters or strings in them. The parsing would be done completely automatically in any language with even basic XML functions, so it would be very easy to implement too. If one has to parse a textfile manually it's always a hassle to parse where one entry begins and one ends. Depending on how you programm it even an empty line can crash the parser and there are a lot of pitfalls for newbie programers. My programs usually detected the end of an entry by linebreak, but even that can cause problems for example with unix style linebreaks being different and it doesn't allow for table values that contain new lines themselves (not common, but who knows, might come in handy). XML files would basically parse themselves and allow for pretty much anything one might need.

Whatever way it'll go, to promote such a format it would be a good idea to have a reference implementation in the form of a DLL file with the most important functions. That way even people who don't even want to bother with the details of the format will be able to incorporate support into their tools easily.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 3rd, 2010 at 3:36pm
I've updated the first post with some information on all of the available table file functions I've seen in the utilities I've recently looked at and what my thoughts on them are.

I didn't really think of XML. What would you be trying to achieve with it? See I'm approaching with the mentality that the only thing that should be in the table file are those things completely necessary for the translation from hex to text characters and vice versa. Some of the other stuff such as bookmarks don't belong in the table file in my opinion. What business does that stuff have in a hex to text translation file?

That stuff came from dumpers and inserters. And you know what? That's where it belongs, with dumpers and inserters, not in your table file. If you're going to go in that direction, the mentality is more considering the table file as a giant general purpose game task configuration file. You may as well write your notes in there too and add some assembly code. :P

I think you have to assign some boundaries and a purpose to what the table file is and what it should include.

Now back to XML, what ideas did you have for what you'd end up doing with it? In the event you had to make a manual change or scan the table for your own human informational or lookup purposes, using something like XML would be more cumbersome. I find myself browsing my table files often for various reasons.

Title: Re: 'Standard' Table File Format
Post by KaioShin on Jun 3rd, 2010 at 5:04pm
I only mentioned bookmarks and stuff since you brought it up. I can see how it can be considered out of the scope.

After parsing the table file, the data in the table should be in some kind of data structure that's easily accesible right? Instead of only standardizing the physical file, why not standardize the datastructure representation too? That would make creating a reference implementation that's interchangable with other tools that use the standard even easier. And a XML file is basically just that, a physical file that also contains the datastructure information. Most languages have libraries that take a XML file as parameter and instantly give you back a tree structure for example.

For a not-so-pro programmer who is trying to create a custom dumper for his or her game, what do you think would be easier? Parsing through the textfile or parsing through a well defined data structure? I think XML would be actually easier for the programmer, but I might be wrong. I personally hate dealing with text files, there are so many annoying small pitfalls. From what you wrote above I'm not 100% sure where you draw the difference between two table entries. Kist with a newline? What about the differences between windows and unix newline conventions? I just hate dealing with that kind of stuff. With a XML file I just search the table entry for the key "0x0A" and get back whatever data was in the value field of the document. No dealing with the underlying mechanics of the file at all.

One advantage of XML are optional attributes. You could have stuff like bookmarks if you want and they'd be completely optional. They'd be parsed alongside the rest and the XML libraries will report to you if they are present or not and you can proceed in your program accordingly. Or you can just ignore them if you don't want to support them and they aren't harming you. In a raw text format you'd have to deal with them during parsing if you want to support them or not and they'd indeed quickly become unwanted baggage.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 4th, 2010 at 10:42am

KaioShin wrote on Jun 3rd, 2010 at 5:04pm:
I only mentioned bookmarks and stuff since you brought it up. I can see how it can be considered out of the scope.

After parsing the table file, the data in the table should be in some kind of data structure that's easily accesible right? Instead of only standardizing the physical file, why not standardize the datastructure representation too? That would make creating a reference implementation that's interchangable with other tools that use the standard even easier. And a XML file is basically just that, a physical file that also contains the datastructure information. Most languages have libraries that take a XML file as parameter and instantly give you back a tree structure for example.


I can see value in a reference implementation such as Klarth's table library or something similar. But I'm not sure I'd try and dictate what the programmer should do after parsing, only suggest and provide easy example library to use or something. It would difficult to declare any type of programming standard. The data structure it goes to is up to the language you use, and what you're trying to do with it.


Quote:
For a not-so-pro programmer who is trying to create a custom dumper for his or her game, what do you think would be easier? Parsing through the textfile or parsing through a well defined data structure? I think XML would be actually easier for the programmer, but I might be wrong. I personally hate dealing with text files, there are so many annoying small pitfalls. From what you wrote above I'm not 100% sure where you draw the difference between two table entries. Kist with a newline? What about the differences between windows and unix newline conventions? I just hate dealing with that kind of stuff. With a XML file I just search the table entry for the key "0x0A" and get back whatever data was in the value field of the document. No dealing with the underlying mechanics of the file at all.


Depends on the language. Is there any XML support in C++ core library or STL? I'm not sure there is. You'd probably have to go to third party or do it yourself.

Yes, newline is currently the differentiation. In any .NET language, you can use TextReader.ReadLine(). It will work fine with newlines of Unix and Windows. I think iostream getline() works appropriately in C++ as well. You probably shouldn't be scanning for new line bytes yourself.

Regardless, how do you MAKE your table to begin with in XML? I think making an XML table would be much more difficult than it is  with simple text. Making one manually would be many times more work. You could use a table maker, but none exist yet to do the job.


Quote:
One advantage of XML are optional attributes. You could have stuff like bookmarks if you want and they'd be completely optional. They'd be parsed alongside the rest and the XML libraries will report to you if they are present or not and you can proceed in your program accordingly. Or you can just ignore them if you don't want to support them and they aren't harming you. In a raw text format you'd have to deal with them during parsing if you want to support them or not and they'd indeed quickly become unwanted baggage.


That certainly makes sense. Then you don't need much of a standard because it could be custom expanded by any utility (like it already is), but unsupported features don't get in the way and are just ignored.

I'm not sure I like that direction because in the end, don't we still end up with tables that aren't going to be very compatible between programs? Have we really done much then? I'm not sure. I guess we still have a base standard.

Next, does a table file really need that much of a data structure? Most entries are in dictionary form. Term A=Term B. Not much more to it. There's our control entries, but I'm approaching them in such a generic way, that we have very few and don't WANT to know much about them.

Title: Re: 'Standard' Table File Format
Post by KaioShin on Jun 4th, 2010 at 1:03pm
I see where you're coming from concerning creating the table file. If it's not tool assisted it would be quite a bother...

Alright, let's stick to plain text then.

Title: Re: 'Standard' Table File Format
Post by KingMike on Jun 5th, 2010 at 1:11pm
Possibly an entry specifying a dictionary entry?
Seems kinda wasteful for the table maker to have to type out
FF00=Entry1
FF01=Entry2
FF02=Entry3
when the program could probably look up the entry.
I'm thinking a value to specify like
FF=Substring,Format,NumberOfBytesInIndexValue
where format can specify if the dictionary table is constant length or not, and also how many bytes to be read to find the index (in the above example, 1 byte).
If constant length, provide the length and the value of the padding byte.
If not constant length, provide the address of the start of the table, and the termination byte value (and then the dumper can look it up).
Maybe we could specify the hex value of the initial entry in the table, too.
Or Pascal format strings (I think that's what they're called, it's when the first byte is the string length)
(in my own program, I thought of being able to find values by pointer-table entry, but then realized it's most likely the pointers would be in sequential value anyway)

Yeah, dictionary MIGHT be something for a custom dumper, but I think it's a common enough practice that it might be worth including in a standard table format.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 5th, 2010 at 5:57pm
I'm not sure if I'm following you. Can you provide an example?

In general, the table file certainly shouldn't contain any ROM addresses, or ROM information of any kind. I'd advise a table making utility to help make generating the dictionary table entries easier.

You have to be careful with trying to turn the table file into a configuration file for a dumper. That's really not what it should be in my opinion.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 11th, 2010 at 2:39pm
See the first post. THE FIRST DRAFT IS UP!

The document is a.) A reference to the file format, b.) explanation for newcomers, and c.) Helpful reference for programmers. I don't think we've ever had anything like this on the subject before!

Good thing I thrive on pain! This didn't help my elbow issues any! :P



KingMike, I've been thinking about your dictionary idea. Since you can just make a table and dump the dictionary from a game, I thought the best idea for handling dictionary is a dumper with a special mode to dump a dictionary so it can be plugged directly into a table file!

I believe I will try to add this feature to my Generic Dumper.  Should be no need to modify the file format for this. Instead, you'd just dump the dictionary and copy/paste to your new table. What do you think about THAT? :)

Title: Re: 'Standard' Table File Format
Post by Next_Gen_Cowboy on Jul 9th, 2010 at 3:18pm
Excellent! That's all I have to add, you must have been going full throttle for a while!  

Title: Re: 'Standard' Table File Format
Post by Gil Galad on Jul 14th, 2010 at 6:29am
Actually, I liked the Thingy table format the best. However, Qbasic just doesn't get it for me anymore. I can't get the EUC or SJIS to display as Japanese characters. I could in Windows 98 by downloading a viewer at the NJstar site.

I talked to Bongo about supporting Thingy tables in Windhex32. I'm getting the impression that some of the features are difficult to code. While I disagree that some of the features of that table file format are not needed.

For example, the table marks for Dakuten and Handukuten can reduce your table file size and time to make the table file. Thingy also has the ability to modify the byte after or before. These modifier tiles are commonly found in NES/FC/FDS games. Maybe other systems too.

I do agree that the bookmarks don't need to be in a table file specifically for dumping the text in a generic dumper. However, I still find it useful to use bookmarks in a hex editor so that I can easily view and jump to various sections of the ROM.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jul 14th, 2010 at 9:44am

Gil Galad wrote on Jul 14th, 2010 at 6:29am:
Actually, I liked the Thingy table format the best. However, Qbasic just doesn't get it for me anymore. I can't get the EUC or SJIS to display as Japanese characters. I could in Windows 98 by downloading a viewer at the NJstar site.


If you want to talk about Thingy specifically, is there anything you didn't specifically mention below? Generally, everything applicable from Thingy made it in, and better implemented at that in most cases.


Quote:
I talked to Bongo about supporting Thingy tables in Windhex32. I'm getting the impression that some of the features are difficult to code. While I disagree that some of the features of that table file format are not needed.


It's not really about if they're needed or useful, rather do they belong and can they fit within the goals we need to meet. I've gone into detail on the specific two issues in question below.



Quote:
For example, the table marks for Dakuten and Handukuten can reduce your table file size and time to make the table file. Thingy also has the ability to modify the byte after or before. These modifier tiles are commonly found in NES/FC/FDS games. Maybe other systems too.


This is bad on many levels. This ruins all abstraction. It ruins language independence. It requires any utility utilizing a table file now be language and character aware. Difficulty of implementation increases greatly. This really goes against much of what we need to accomplish here. The table file's purpose is to map hex to text and vice versa. This dakuten/handuten forces the actual conversion to the utility. Right now, we have complete isolation. All conversion is done from the table. The utility is abstracted and doesn't need to know any language or character information beyond the initial table parse special characters.

I have a very strong disagreement with doing anything of the sort. A way to handle this situation while maintaining this abstraction level and no character dependency would be welcome. If it requires a little extra table work, that's much better than the consequences of losing abstraction and utility character independence. I just threw out some possible alternates. Really, this is a very specific case to a specific language and specific console. So, the fact that it could be done with what we have was enough for me.



Quote:
I do agree that the bookmarks don't need to be in a table file specifically for dumping the text in a generic dumper. However, I still find it useful to use bookmarks in a hex editor so that I can easily view and jump to various sections of the ROM.


It's not a matter of whether it's useful, but whether it belongs. Bookmarks are specific program settings for specific hex editors. They have nothing to do with mapping hex to text or vice versa.

The table file isn't a dumping ground for any old thing you might want to throw in it. If you think it is, we may as well store pointer information in it, the ROM filename, checksums, assembly text hacks, etc. Bookmarks belong in a game specific configuration file for the utility rather than in a table file.

The table file as it's name implies serves a single purpose. A table mapping hex to text and vice versa. Does that make sense?

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Aug 13th, 2010 at 3:54pm

Nightcrawler wrote on Jun 3rd, 2010 at 11:26am:
However, I will say I plan to support the most common encodings in my dumper. Probably just UTF-8, S-JIS, EUC-JP, and ASCII.


You forgot Big5 and HKSCS, both of which certainly do belong to most common encodings. If you plan on doing that (which might be a waste, because there are very few characters from these sets not in Unicode), at least think about language-tagging of some sorts.

UTF-8 files should be required to have a BOM - which you don't specify in your document anywhere. ASCII is basically indistinguishable from UTF-8 without BOM. So are some of the other encodings.
Personally, I wouldn't guess file encodings or let the user specify. Things get mixed up and you will have to have support for determining if all characters in the file were representable in your destination encoding/codepage.

There are of course pros and cons to using XML over a plain text file. First and foremost, you won't have one-character control codes inside the table file.

Code:
@C0=3,C000
is just not as readable as
Code:
<array bytesFollowing="3" baseOffset="C000">C0</array>
no matter how you turn it.
You could of course compensate for this with longer control codes, like
Code:
array:C0=3,C000
for instance. This decision is obviously up to you and good cases have been made in favor and opposed to the proposal of using xml files. However, just think about the comfort of using an xml-based table editor in general. From a user perspective, it would only add comfort and you yourself said we should go with the times. And the times favor taggable, extendable formats, not plain text files with hard-to-remember control sequences.

Next, I would quote your whole document in a spoiler, but this board doesn't seem to have spoilers. Generally, I would have preferred a tex file for presentation of this. Plain text documentation is somewhat not this day and age and also your figures in there become pretty non-understandable.

  • Less screaming in 2.1.
  • 2.1 is missing a description of what happens when two entries collide, that is
    • 00=Five
    • 01=Six
    • 0001=Seven
    What happens in these instances? Which string will get dumped on the byte sequence 0x00 0x01?
    "FiveSix" or "Seven"? Historically it would be "Seven", but you would have to specify this.
  • 2.1 Define what happens on illegal sequences. Preferably, these should be ignored from a technical point of view.
  • Section 2 is the only section whose caption format does not use tabs throughout. Also, it would be preferable if 2.2 were "Regular entries" and 2.3 not restricted to being "formatting".
  • For escape sequences in 2.3. Don't use /r if it isn't what /r is in most programming languages. Historically, /n is newline while /r is carriage return. Defining your own escape sequences is good, but this labeling is misleading.
    Also, I do not understand the difference between /r and /n and your example is lacking.
  • Is there a reason /r and /n have spaces after them while /t has a tab after it in the table?
  • You lack documentation how to insert / as a literal. This would ideally be through double escaping // for literal /.
    All other sequences should be invalid and should be ignored for future compatibility. Example why: TBL v1.0 doesn't support /y so some people might decide to turn this into literal / literal y. TBL v1.1 supports control sequence /y for something specific. You just broke upwards compatibility. Better to reserve all control sequences and then have nobody accidentally use them in his dumps.
  • You mix single quotes '' and double quotes "" to mean the same and different things in different parts of the document. For instance, you talk about '=' (which would be traditional notation for character) Yet (bold mine):
    Quote:
    [One] can use script formatting values like '\n' to do something[...]
    This should be double quoted, because you're talking about the string "\n" and not the character '\n' as a newline character in the actual table file.
  • 2.3. Make a better figure, explain "//" in dumps (is this Atlas format, etc). Make examples for all control codes.
  • 2.5. Make explicit what values are allowed as "label" [commas obviously aren't, are "formatting control codes"?. Also, you didn't give a name to the "label" part, you should. If you want to be a libertarian about the hexadecimal format, at least talk about insertion problems (the string might be representable with table entries!) and give a properly escaped sequence. If you build a reference with a bad example, many other peeps will still follow it.
  • Array entries and table switching should be combined. First off, the two are abstractly speaking the same and secondly, this seems specifically tailored to Han characters.
    Let me elaborate: Arrays can be thought of as table switching for the next N bytes and be implemented much in the same manner as table switching itself. While table switching is usually not limited by character number, it can be thought of as just being so for array entries. I therefore propose:
    • Unify the both. Drop the base offset for both, make the number of bytes optional
    • If possible, include an option to switch back to the old table once an invalid entry is recognized (i have seen games do just this).
    • Don't limit yourself to two tables. You can have more tables than that if you implement arrays as table switching.
    • See the added benefit of not having assumed a format about the array: Your format was specifically tailored for 1 byte entries where it now can have variable size and multi-byte entries just like tables. Of course the old idea is still possible, too.
    • Base offsets can be accomplished using different tables and don't assume an entry size either.
  • 3.1. assumes a dictionary format. While two bytes are common, there can be all sorts of dictionary. I would personally therefore drop the bit about how to dump them, since this seems to make normative claims when it really shouldn't be
  • If you are talking about language specific problems, explain a bit about them before. Handakuten and dakuten are not common lingo for everybody. Also, it seems fairly relevant to point to this example in 2.1 or at least 2.0
  • 4.1. add Macintosh and CR only usage and possibly check if libraries can handle them as to give good advice
  • 4.3.4 is missing a word
  • 2.3. Fix "the the"


This seems like a really good draft so far. However, you talk a whole lot about globalization efforts, yet you neglect complex script shaping.

You should also in that regard, forbid certain sequences that UTF-8 may contain, such as LTR and RTL marks.

Consider Arabic languages. We have seen some share of these translations on RHDN some time ago. Please refer to Chapter 08 of the Unicode standard and Technical Report 09, section 3.5 Shaping to get a basic understanding of what is required.

Basically, Arabic will use at least (but I'm not sure limited to) four different forms per letter depending on their position inside of words. There will usually be a need to implement these forms as different characters in NES roms for example (since nobody will probably be writing a shaping engine on NES anytime soon). There should be a way to select each of these forms in the table file somehow and give them distinct string representations.

Now, you may say that these situation could be handled by simply putting "alif4" into the script for example. However, as alif and 4 don't combine, you will possibly lose ZWJ and ZWNJ (Zero Width (Non) Joiner) in this process if you write the string by hand.

The same thing would possibly hold true for some Indian scripts such as Devanagari, so it's not - generally speaking - a language specific problem. There should be some standard way to address different glyph shapes and assign them different hexadecimal values.

If not, at least a viable approach should be given how to circumvent this issue (like in the case with Japanese).

cYa,

Tauwasser


Nightcrawler: I pulled the old click Modify instead of Reply trick. I think I recovered your full post.  :-[

Title: Re: 'Standard' Table File Format
Post by KingMike on Aug 14th, 2010 at 9:50pm
As to handakuten, it seems more common for games that treat them as separate bytes, to play the dakuten AFTER the main character, not before.
So, it would be like:

60=カ
607F=ガ

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Aug 17th, 2010 at 11:12am

Tauwasser wrote on Aug 13th, 2010 at 3:54pm:
You forgot Big5 and HKSCS, both of which certainly do belong to most common encodings. If you plan on doing that (which might be a waste, because there are very few characters from these sets not in Unicode), at least think about language-tagging of some sorts.


It's not so much common in our community that I'm aware of. UTF-8 is the default and primary encoding. Anything I add beyond that will be an extra gift under the tree.


Quote:
UTF-8 files should be required to have a BOM


Due to the fact that BOM is optional in many applications that work with UTF-8 text files and 95% of the userbase won't even know what a BOM is (many don't even know what UTF-8 is), I do not agree with requiring this. One aim of the whole thing is also as much backward compatibility as possible while still moving forward. Compatibility with older ASCII tables even if technically 'wrong' is desirable. Asking too much is a sure way to get our community to do nothing at all. Baby steps. ;)


Quote:
There are of course pros and cons to using XML over a plain text file.


There's certainly many benefits to XML. However, the aim of this was to standardize what we have, move us forward, but still maintain as much compatibility as we can with existing table files and utilities. And again, I don't think a clean break to something completely incompatible and new works in our community. If we were to ever move to XML, I'd suggest doing so with a converter built into applications that could convert from the old to the new format.


Quote:
Generally, I would have preferred a tex file for presentation of this. Plain text documentation is somewhat not this day and age and also your figures in there become pretty non-understandable.


There's something to be said for a UTF-8 plain text document defining a file format using UTF-8 plain text.  Can't say I know much about TeX. Seems unnecessary for this document. PDF in general is a future option for bookmarks and nicer presentation. Also, there's the time factor. I probably won't want to do the rework for another format.

Good list here. Many items were addressed. A few remain for discussion.

  • Less screaming in 2.1.
  • 2.1 is missing a description of what happens when two entries collide, that is
    • 00=Five
    • 01=Six
    • 0001=Seven
    What happens in these instances? Which string will get dumped on the byte sequence 0x00 0x01?
    "FiveSix" or "Seven"? Historically it would be "Seven", but you would have to specify this.

  • 2.1 Define what happens on illegal sequences. Preferably, these should be ignored from a technical point of view. Shouldn't the decision to generate error or ignore be that of the utility and not the file format?
  • Section 2 is the only section whose caption format does not use tabs throughout. Also, it would be preferable if 2.2 were "Regular entries" and 2.3 not restricted to being "formatting".1. not sure what you mean by the caption format. 2. Normal/Regular are synonyms. Either would be OK, but I think 'normal' is slightly more appropriate here as it applies to a standard. 3. Agreed.
  • For escape sequences in 2.3. Don't use /r if it isn't what /r is in most programming languages. Historically, /n is newline while /r is carriage return. Defining your own escape sequences is good, but this labeling is misleading.Agreed. It is like this for compatibility with Cartographer, ROMJuice, and Atlas. Will discuss with other utility authors.
    Also, I do not understand the difference between /r and /n and your example is lacking.
  • Is there a reason /r and /n have spaces after them while /t has a tab after it in the table?
  • You lack documentation how to insert / as a literal. This would ideally be through double escaping // for literal /.
    All other sequences should be invalid and should be ignored for future compatibility. Example why: TBL v1.0 doesn't support /y so some people might decide to turn this into literal / literal y. TBL v1.1 supports control sequence /y for something specific. You just broke upwards compatibility. Better to reserve all control sequences and then have nobody accidentally use them in his dumps.

  • You mix single quotes '' and double quotes "" to mean the same and different things in different parts of the document. For instance, you talk about '=' (which would be traditional notation for character) Yet (bold mine):
    Quote:
    [One] can use script formatting values like '\n' to do something[...]
    This should be double quoted, because you're talking about the string "\n" and not the character '\n' as a newline character in the actual table file.

  • 2.3. Make a better figure, explain "//" in dumps (is this Atlas format, etc). Make examples for all control codes.
  • 2.5. Make explicit what values are allowed as "label" [commas obviously aren't, are "formatting control codes"?. Also, you didn't give a name to the "label" part, you should. If you want to be a libertarian about the hexadecimal format, at least talk about insertion problems (the string might be representable with table entries!) and give a properly escaped sequence. If you build a reference with a bad example, many other peeps will still follow it.Good point on being representable by table entries. Will discuss.
  • Array entries and table switching should be combined. First off, the two are abstractly speaking the same and secondly, this seems specifically tailored to Han characters.
    Let me elaborate: Arrays can be thought of as table switching for the next N bytes and be implemented much in the same manner as table switching itself. While table switching is usually not limited by character number, it can be thought of as just being so for array entries. I therefore propose:
    • Unify the both. Drop the base offset for both, make the number of bytes optional
    • If possible, include an option to switch back to the old table once an invalid entry is recognized (i have seen games do just this).
    • Don't limit yourself to two tables. You can have more tables than that if you implement arrays as table switching.
    • See the added benefit of not having assumed a format about the array: Your format was specifically tailored for 1 byte entries where it now can have variable size and multi-byte entries just like tables. Of course the old idea is still possible, too.
    • Base offsets can be accomplished using different tables and don't assume an entry size either.
    I like it, however I think there is difficulty in implementation or at least more to define. What if you have two table switches in the same table? One for a kanji array, one for a hiragana/katakana switch. How do you define that? How do you see the syntax for those cases?
  • 3.1. assumes a dictionary format. While two bytes are common, there can be all sorts of dictionary. I would personally therefore drop the bit about how to dump them, since this seems to make normative claims when it really shouldn't be
  • If you are talking about language specific problems, explain a bit about them before. Handakuten and dakuten are not common lingo for everybody. Also, it seems fairly relevant to point to this example in 2.1 or at least 2.0
  • 4.1. add Macintosh and CR only usage and possibly check if libraries can handle them as to give good advice
  • 4.3.4 is missing a word
  • 2.3. Fix "the the"



Quote:
This seems like a really good draft so far. However, you talk a whole lot about globalization efforts, yet you neglect complex script shaping.


I really don't know enough about Arabic languages, script shaping, or the UTF-8 intricacies associated with them. I also don't intend to take up that course of study. If you want to write up something appropriate that should be included in the document, I can look at it for potential inclusion.

I would probably say I wouldn't like to require any more work for utility creators for special support of Arabic languages. We're already asking a lot and pushing whether or not anyone will ever adopt this. And I'd certainly like to keep the abstraction level we have of never having to look at individual text characters after the initial table parse (even then it's limited).

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Aug 17th, 2010 at 11:13am

KingMike wrote on Aug 14th, 2010 at 9:50pm:
As to handakuten, it seems more common for games that treat them as separate bytes, to play the dakuten AFTER the main character, not before.
So, it would be like:

60=カ
607F=ガ


Handled the same I think. Updated document to reflect.

Title: Re: 'Standard' Table File Format
Post by DaMarsMan on Aug 17th, 2010 at 2:44pm
Hmmm I don't have many thoughts but here they are.

2.1 Encoding: UTF8 only. Moving away from the older utilities is a must. We should push the best utilities to update their source (WindHex and others). We can't let old standards hold back the community!!!! I think we both agree here.   ;D

2.2 Normal Entries: I would include something about priorities in here. (I don't know if I missed it) I know it's up to the inserter but I think Atlas does a good job with the way it does it and it is probably worth mentioning. For an explanation see this thread.
http://www.romhacking.net/forum/index.php/topic,8108.0.html

2.3 Control Codes: I don't get what is going on with "\r" here. Shouldn't someone who wanted comments after just use "\n//"?

FE=<linebreak>\n//
/FF=<end>\n\n\n//

Shouldn't this produce the same thing?

2.7      Dual Table Files: I agree with Tauwasser here. Let's not limit it to two. I've had cases where I needed multiple table files.

What if we had another table format like maybe "tbp" that was a pack of multiple tables and was basically a table but had multiple tables in it? You could use something like "TABLE=English" or something to divide them up. This could be a cool feature to load one table pack into a hex editor and flip between them. Or, load one table file into an Atlas script and jump between different parts of the table (Can be useful for inserting original Japanese on untranslated parts). It's just an idea to extend it a bit.

That's really all I got. I don't think we really need documentation for other languages. The more complex this standard gets, the less likely people will be wanting to implement it into programs. I like the part about the Japanese because the majority of games come from Japan and it makes sense. Arabic games and games with strange encoding should probably have custom inserters.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Aug 17th, 2010 at 5:38pm

DaMarsMan wrote on Aug 17th, 2010 at 2:44pm:
Hmmm I don't have many thoughts but here they are.

2.1 Encoding: UTF8 only. Moving away from the older utilities is a must. We should push the best utilities to update their source (WindHex and others). We can't let old standards hold back the community!!!! I think we both agree here.   ;D

2.2 Normal Entries: I would include something about priorities in here. (I don't know if I missed it) I know it's up to the inserter but I think Atlas does a good job with the way it does it and it is probably worth mentioning. For an explanation see this thread.
http://www.romhacking.net/forum/index.php/topic,8108.0.html


Agreed on both. I'm not sure I understand Klarth's comment in the topic you provided. But, longest entry will take preference and handle that situation. It's the same for the hex side.

Table:
00=Five
01=Six
0001=Seven

If a byte sequence 0x00 0x01 is encountered, the string "Seven" should be mapped as the result and not any other combination regardless of table order.

Table:
12=Five
13=Six
0001=FiveSix

If the text 'FiveSix' is encountered, it will map as byte sequence $00 $01 regardless of the order it appears in the table.

That's the desired way functionally, my preference, and the way I understand Atlas is supposed to do it based on the code.


Quote:
2.3 Control Codes: I don't get what is going on with "\r" here. Shouldn't someone who wanted comments after just use "\n//"?

FE=<linebreak>\n//
/FF=<end>\n\n\n//

Shouldn't this produce the same thing?


I would think so, yes. Blame that on Cartographer. That's what it does. The redundancy didn't dawn on me. That should simplify escape codes to nothing but line breaks with the standard "\n" value.


Quote:
2.7      Dual Table Files: I agree with Tauwasser here. Let's not limit it to two. I've had cases where I needed multiple table files.

What if we had another table format like maybe "tbp" that was a pack of multiple tables and was basically a table but had multiple tables in it? You could use something like "TABLE=English" or something to divide them up. This could be a cool feature to load one table pack into a hex editor and flip between them. Or, load one table file into an Atlas script and jump between different parts of the table (Can be useful for inserting original Japanese on untranslated parts). It's just an idea to extend it a bit.

That's really all I got. I don't think we really need documentation for other languages. The more complex this standard gets, the less likely people will be wanting to implement it into programs. I like the part about the Japanese because the majority of games come from Japan and it makes sense. Arabic games and games with strange encoding should probably have custom inserters.


I'd still like to hear of syntax on this table switch idea and details on how it would be implemented and operate based on my questions in green on the list. We don't want to get carried away and don't want to force a complicated implementation on the programmer.

Title: Re: 'Standard' Table File Format
Post by DaMarsMan on Aug 17th, 2010 at 5:51pm

Nightcrawler wrote on Aug 17th, 2010 at 5:38pm:
Agreed on both. I'm not sure I understand Klarth's comment in the topic you provided. But, longest entry will take preference and handle that situation. It's the same for the hex side.

Table:
00=Five
01=Six
0001=Seven

If a byte sequence 0x00 0x01 is encountered, the string "Seven" should be mapped as the result and not any other combination regardless of table order.

Table:
12=Five
13=Six
0001=FiveSix

If the text 'FiveSix' is encountered, it will map as byte sequence $00 $01 regardless of the order it appears in the table.

That's the desired way functionally, my preference, and the way I understand Atlas is supposed to do it based on the code


This isn't really what I was getting at... Take a look at this scenario.

12=Five
13=Six
00=FiveSix

Here we don't have a longest hex string... Should this produce 1213 or 00. I believe that according to that thread, Atlas would output it as 1213 because of the order of the table.


Title: Re: 'Standard' Table File Format
Post by Tauwasser on Aug 18th, 2010 at 7:45am

Nightcrawler wrote on Aug 17th, 2010 at 11:12am:
Due to the fact that BOM is optional in many applications that work with UTF-8 text files and 95% of the userbase won't even know what a BOM is (many don't even know what UTF-8 is), I do not agree with requiring this.


While this does preserve ASCII compatibility (as ASCII can be interpreted as UTF-8 in ANSI format), it shouldn't be forbidden to use a BOM and would also immensely help identifying encoding in case you want to support other encodings. At least UTFs are uniquely identifiable with their BOMs.


Quote:
There's certainly many benefits to XML. However, the aim of this was to standardize what we have, move us forward, but still maintain as much compatibility as we can with existing table files and utilities. And again, I don't think a clean break to something completely incompatible and new works in our community. If we were to ever move to XML, I'd suggest doing so with a converter built into applications that could convert from the old to the new format.


Did you at least think about my proposal of allowing longer sequences separated by a colon before the hexadecimal? Just saying that I do prefer "linebreak:FF" instead of "\FF".


Quote:
There's something to be said for a UTF-8 plain text document defining a file format using UTF-8 plain text.


This isn't the issue. Of course a plain-text-like presentation is good and can be kept with LaTeX as well.


Quote:
Seems unnecessary for this document. PDF in general is a future option for bookmarks and nicer presentation. Also, there's the time factor. I probably won't want to do the rework for another format.


Your figures and explanations would benefit greatly from this. So it doesn't seem unnecessary to me.


Quote:
  • 2.1 Define what happens on illegal sequences. Preferably, these should be ignored from a technical point of view. Shouldn't the decision to generate error or ignore be that of the utility and not the file format?


You define what is supposed to happen for the sequences you do talk about, so why can't you make explicit what of these should happen if a control sequences such as "\<byte>" is encountered more than once?
You describe how tools should behave in there anyway, so why don't make a standard that tells tool makers how to handle incorrect data in a standard fashion so that a reference implementation will always be what users get from every tool.
This does not mean that tool makers cannot give the users dialogs to choose what he wants or alter behavior. It just means when the user presses the "dump it anyway biatch" button, that he'll get what you're talking about in the document.


Quote:
  • Section 2 is the only section whose caption format does not use tabs throughout. 1. not sure what you mean by the caption format.



(Done with Paint.NET for your Lulz)


Quote:
  • For escape sequences in 2.3. Don't use /r if it isn't what /r is in most programming languages. Historically, /n is newline while /r is carriage return. Defining your own escape sequences is good, but this labeling is misleading.Agreed. It is like this for compatibility with Cartographer, ROMJuice, and Atlas. Will discuss with other utility authors.



Quote:
[quote]
2.3 Control Codes: I don't get what is going on with "\r" here. Shouldn't someone who wanted comments after just use "\n//"?

FE=<linebreak>\n//
/FF=<end>\n\n\n//

Shouldn't this produce the same thing?


I would think so, yes. Blame that on Cartographer. That's what it does. The redundancy didn't dawn on me. That should simplify escape codes to nothing but line breaks with the standard "\n" value.[/quote]

I believe this is dealt with.


Quote:
  • Array entries and table switching should be combined. First off, the two are abstractly speaking the same and secondly, this seems specifically tailored to Han characters.
    Let me elaborate: Arrays can be thought of as table switching for the next N bytes and be implemented much in the same manner as table switching itself. While table switching is usually not limited by character number, it can be thought of as just being so for array entries. I therefore propose:
    • Unify the both. Drop the base offset for both, make the number of bytes optional
    • If possible, include an option to switch back to the old table once an invalid entry is recognized (i have seen games do just this).
    • Don't limit yourself to two tables. You can have more tables than that if you implement arrays as table switching.
    • See the added benefit of not having assumed a format about the array: Your format was specifically tailored for 1 byte entries where it now can have variable size and multi-byte entries just like tables. Of course the old idea is still possible, too.
    • Base offsets can be accomplished using different tables and don't assume an entry size either.
    I like it, however I think there is difficulty in implementation or at least more to define. What if you have two table switches in the same table? One for a kanji array, one for a hiragana/katakana switch. How do you define that? How do you see the syntax for those cases?


You are right, of course. There is more to define. I like the following approach:

Have each table have a unique (within the set of used tables) id number.


Code:
id:TBL001


You would then only need to associate each table with this ID and on switch commands tell which table to switch to.

Within TBL001:

Code:
switchTable:F8=TBL002


Within TBL002:

Code:
switchTable:F8=TBL001


(elaborating on your Hiragana/Katakana example). As you see, this would be pretty redundant (albeit doable) for two tables only. So I suggest, the table code could be optional for the simplified case of two tables. (And can be disambiguated at runtime by the dumper quite easily).

If you load more tables without markings, the dumper might decide what to do and for instance, prompt the user into which table each code should switch.

For the kanji arrays, this would simplyfy to the following (again, your example elaborated upon):

In TBL001:

Code:
switchTable:C0=KANJI001
switchTable:C1=3, KANJI002


In KANJI001:

Code:
switchTable:XX=TBL001


In KANJI002 you don't need a code, since it changes back after 3 table matches (not bytes) regardless (to the last used table). This would also mean, that you can set up for circular table changing, TBL001 --> KANJI001 --> KANJI002 --> KANJI001 --> TBL001 --> KANJI002 --> KANJI001 etc. if you should wish for that.

The kanji tables themselves can be easily computed from the current tables, that have your offset of 0xC000 added to their entries.
Notice also how there is no need for offsets of any kind.
To further simplify this process, one might designate a length of 0 matched table entries to mean "change back on first table entry not found in table".

In HIRA:


Code:
00=あ
01=い
02=う
03=HIRO
switchTable:F8=0,KATA
switchTable:F9=0,KANJI


In KATA:

Code:
00=ア
01=イ
02=ウ
switchTable:F8=0,HIRA
switchTable:F9=0,KANJI


In KANJI:


Code:
00=亜
01=意


You would start in table HIRA (By load order for instance, or by designation by user. This is a choice of the tool.)

0xF8 0x00 0x01 0x02 0xF9 0x01 0x03

HIRA --> KATA --> ア --> イ --> ウ --> KANJI --> 意 --> 0x03 fallback to KATA --> 0x03 fallback to HIRA --> HIRO.

This would make for a sound table switching routine that is a farily easy-to-implement data table search in most programming languages (including .NET and C++ with STL).


Quote:
I really don't know enough about Arabic languages, script shaping, or the UTF-8 intricacies associated with them. I also don't intend to take up that course of study. If you want to write up something appropriate that should be included in the document, I can look at it for potential inclusion.


Sadly I'm not sure myself. I just know that traditional methods don'T work quite well and it needs a lot of custom tools to produce the right output in the end, mostly replacing stuff with numbers etc while propagating shaping differences along... It's really unsatisfactory.

Also, this could potentially be used for kerning pairs in VWFs. At least I have implemented a few myself, but usually they will be explicit kerning pairs, that is they use two different byte representations for the same gfx data. This could be made available in a table format, I think.

So "VA" might produce different output than "VX" based on kerning - just like in so many modern fonts :D

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Aug 18th, 2010 at 7:48am
What I said is all still true. That's not a different scenario. When mapping in the direction of text to hex, you have a longest TEXT string key. When mapping in the direction of hex to text the longest HEX string key.

Table order is immaterial. Using your example, string "'FiveSix" would be output as $00.

Make sense?

Atlas is supposed to work the same way with 'longest' keys according to the source code. See Table.h/Table.cpp.

Title: Re: 'Standard' Table File Format
Post by DaMarsMan on Aug 18th, 2010 at 4:26pm
Okay gotcha.  :D

Title: Re: 'Standard' Table File Format
Post by Gil Galad on Aug 21st, 2010 at 5:26am
I just read through all this. This new format could get pretty complicated, and I advise against that. But I guess some things would be needed to handle many different PC platforms and languages.

I just don't really have much to say about it right now, though. Perhaps I would know more once this theory is put into practice just to see how it works.

All the endcoding details is slightly confusing to me. Before, I would just use SJIS or EUC and not worry about anything else. Because it's simple and practical (at least for me).

I think that Cartographer had a good idea for table file switching. You have different sections for dumping various areas of the ROM, if I remember correctly, you can assign different table files for each section you want to dump.


Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Aug 23rd, 2010 at 11:36am

Tauwasser wrote on Aug 18th, 2010 at 7:45am:
While this does preserve ASCII compatibility (as ASCII can be interpreted as UTF-8 in ANSI format), it shouldn't be forbidden to use a BOM and would also immensely help identifying encoding in case you want to support other encodings. At least UTFs are uniquely identifiable with their BOMs.


Absolutely. BOM should be allowed and encouraged. It's just not required. I will add a line about this.


Quote:
Did you at least think about my proposal of allowing longer sequences separated by a colon before the hexadecimal? Just saying that I do prefer "linebreak:FF" instead of "\FF".


Nothing wrong with the idea. I brought it up with Klarth, but I think we're going to end up keeping single special character parsing for ease of programming implementation and back compatibility.


Quote:
Your figures and explanations would benefit greatly from this. So it doesn't seem unnecessary to me.


Klarth said he may pretty it up into PDF. I just don't have the motivation now... Maybe motivation will come back after some has passed, but document formatting isn't fun for me.


Quote:
You define what is supposed to happen for the sequences you do talk about, so why can't you make explicit what of these should happen if a control sequences such as "\<byte>" is encountered more than once?
You describe how tools should behave in there anyway, so why don't make a standard that tells tool makers how to handle incorrect data in a standard fashion so that a reference implementation will always be what users get from every tool.
This does not mean that tool makers cannot give the users dialogs to choose what he wants or alter behavior. It just means when the user presses the "dump it anyway biatch" button, that he'll get what you're talking about in the document.


I see your point. I've run it by Klarth. So far duplicate entries, empty entries, and invalid syntax should generate an error.


Quote:
(Done with Paint.NET for your Lulz)


Got it. Will fix in next revision. ;D


Quote:
I believe this is dealt with.

Yes, but because we only have one now ("\n"), we may 'cheap out' on this and not reserve all escape sequences or allow literal "\n" to be in the script for ease of implementation. It's 'wrong' to do this, but requiring full escape code parsing and handling just for this is probably too much on the programmer. We're probably already pushing our luck with what we have if anyone is going to adopt this outside my small group.


Quote:
You are right, of course. There is more to define. I like the following approach:


I like it. I think it's a pretty powerful feature that could cover different implementations of kanji/han, Handakuten/Dakuten, hiragana, katakana, and even dictionary. There's a relatively low trade-off in programming complexity, however it may still be a hard sell. I've pointed Klarth to your example to see if I can get him to agree. (He didn't like the Kanji Array entry to begin with.)


Quote:
Also, this could potentially be used for kerning pairs in VWFs. At least I have implemented a few myself, but usually they will be explicit kerning pairs, that is they use two different byte representations for the same gfx data. This could be made available in a table format, I think.

So "VA" might produce different output than "VX" based on kerning - just like in so many modern fonts :D


You can already handle this with the table format using explicit pairs like 12="VA", right? If so, that's probably good enough. The idea can be put in a locker until we're ready for the next generation table format in which more complicated scenarios would be on the table.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Sep 16th, 2010 at 2:57pm
OK, a new draft is up! I've gone through everything here to date including my conversations with Klarth and updated accordingly. It probably needs a bit of editing, but content wise it's all there. The only other thing I may add is specifying what to do when an entry is not found. Specify how hex output should look.

Some items addressed:

  • Byte Order Mark (BOM)
  • Hex/Text Collisions
  • Illegal Sequences/Syntax Error
  • Table Switching Section
  • Edited common situations to include table switching
  • Edited format control to "\n" only.


I admit, it's starting to get a bit unruly in straight text format. Hard to keep consistency and readability. I will likely end up slapping in Word and making a PDF. Although it would be nice if someone else would do that part and pretty it up.

So, any other thoughts or suggestions?

Title: Re: 'Standard' Table File Format
Post by DaMarsMan on Sep 20th, 2010 at 9:53am
Okay... I've thought about the \n and here is my concern. Let's say you are dumping Japanese text and you need it commented.


Code:
//Japanese text here.<line>
//Japanese text here.<end>


You could do something like.
FE=<line>\n//

However, your text would output like this...



Code:
Japanese text here.<line>
//Japanese text here.<end>


Keep in mind on the end tag you could do...

FF=<end>\n\n//

That would fix it for every single entry besides the first. It's not too much trouble to go in and add the first // after the dump. Maybe you want something like this though...

FE=<line>\n\c{//}
FF=<end>\n\n\c{//}

Here I propose a comment system that says to add a comment to the beginning of that line for dumping and allows the user to specify which comment style to use.

I believe you have discussed before on keeping this up the the actual script dumper. That is certainly an option and I could see how this sort of thing can cause a problem. However, if you are keeping it up to the script dumper maybe \n should be removed too. Where should the line be drawn?

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Sep 21st, 2010 at 10:37am
Yes, I don't believe the table file needs to have any concept of what a comment system is. Keep the abstraction. It's also an extremely specific post processing behavior that would occur after table mapping. With your idea you'd dump and hit an end token. Then depending on what it was have to go back to the beginning of the string and comment each line accordingly. I think everything we have now can be done in the mapping stage and not interrupt forward flow.

I would agree with removing it entirely as controls don't necessarily belong in the table file either by pure definition. However, a line break is really just a character that can be used in any table line. It is arguably part of the mapping (for dumping anyway). Also, we have charted a bit into gray area with our table switching, and linked entries, and \n in order to provided a high level, standard simple solution to things that are in every script. So, we have encroached on how things should be dumped or inserted with this, but the benefits of standardizing these things and flexibility of the solutions presented outweigh the cons.

With that said, what we have does keep a very high abstraction level regardless, and I would like to be absolute about keeping it.


Solution:

I ran into this issue developing my utility that implements this standard. I believe the solution to this is extremely simple. Any dumper worth anything would have some sort of header or template ability. Mine is user defined (allowing use of some variables) and would be output at the top all dumped files.

Example:

Quote:
//Game Name:    GameNameHere
//Source File: $file
//Block:     $block
//Block Range:   $start - $stop
//$text


So, as you can see that takes care of the issue.

Even a 'cheap' dumper could just output a single comment character or no comment character based on option to be compatible.

It seems like you'd either go in that direction or go in the direction of removing it entirely. However, I'd rather that no remove it entirely as line breaks can be used in any table entry, control code, end token etc. It is part of the mapping if you look at it like that.

Title: Re: 'Standard' Table File Format
Post by DaMarsMan on Sep 22nd, 2010 at 10:52am
I can see what you mean about making an exception for something that is almost always needed.

I would say that the best approach would probably be to have dump configuration file standards. I can see how you would have a problem with something like this though...

FE=\n<line>\n//

In this case the dump controls have to be mixed in with the table to get the proper output. Maybe if there were a dump configuration file you could just have an overwrite table character function for the control characters. With our current method, viewing these controls in a hex editor can get kind of nasty when every instance of FE is shown as the string above... An external, separate configuration file could solve some of these issues.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Sep 22nd, 2010 at 12:02pm

DaMarsMan wrote on Sep 22nd, 2010 at 10:52am:
I can see what you mean about making an exception for something that is almost always needed.


Is it really even an exception? "\n" really represents a character that can be part of any string. It is part of the map. The exception is it can be ignored such as in insertion because it's not necessary. It really depends on if you view it as a control or as a character. I can certainly see both sides, but view it more as a character myself.


Quote:
I would say that the best approach would probably be to have dump configuration file standards. I can see how you would have a problem with something like this though...

FE=\n<line>\n//

In this case the dump controls have to be mixed in with the table to get the proper output. Maybe if there were a dump configuration file you could just have an overwrite table character function for the control characters. With our current method, viewing these controls in a hex editor can get kind of nasty when every instance of FE is shown as the string above... An external, separate configuration file could solve some of these issues.


I don't see this as a problem. A hex editor would just ignore \n for in-editor display purposes. They do that already with line breaks and string ends, right? They don't try to actually show the line breaks, but they show up when dumped.

You can already make tables that use this character in Romjuice, Cartographer, and Klarth's Table Library. The situation has been around for a long time and has never caused an issue. I don't think it would start now. It's not changed much from what we had already.

As far as dumping and insertion standards. Klarth and I have touched on this briefly. It's certainly another topic for another day. However, one conclusion we started to reach is that most custom dumping scenarios could be accounted for if we had a dumper that had batch file and operation abilities, more robust pointer handling and/or scanning, tree structure handling, and robust table switching. We went over several scenarios where we had custom dumpers and why. It turns out, we could eliminate many of those situations by just a few improvements in those areas.

I will see if I can address some of those areas in my utility. We plan to pick the conversation up when he's back in the states at some point in the future. Perhaps we can have a public discussion then.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Sep 30th, 2010 at 11:57am
Some more changes:

Added:
2.2.1 No Entry Found Behavior for Dumping
2.2.2 No Entry Found Behavior for Insertion

Klarth thought it was worth defining/standardizing this behavior and how hex values should be output (<$XX>).

In addition, we decided a few limitations for simplicity of implementation. Since we only have one control code, literal "\n" is not possible to use in a table entry as it is always interpreted as a control code. A simple search and replace can be used with it.

Along those same lines, a raw hex sequence like "<$XX>" is always interpreted as hex, even if a normal table entry may overlap or conflict.

It's probably not the 'right' thing to do, but we've reached a point where we're not interested in further complexity or in depth modification of existing tools/libraries such as Atlas and Klarth's table library.

What we have already moves us forward and a compatible dumper and inserter should be a good stopgap for years to come until something new and better is created with appropriate tools.(such as new XML formats for tables and scripts etc.)

My aim here was three fold 1.) to pull together everything we had and standardize it. 2.) To improve and give us more. 3.) Remain somewhat compatible so we can actually have utilities that use this sooner rather than later.

I think I've about reached those goals. We have much improved features. We will have tangible tools and libraries. We will have my dumper. Klarth has agreed to update Atlas as well as his public C++ table library to use the standard. I can probably nag RedComet into updating Cartographer to use the revised version of Klarth's table library too.

With that said, that's probably enough to hold us over a long while knowing how this community operates. :)

I would say the document is in final review for content. Once content is finalized, I will see about putting in into a PDF or see if Klarth will. Then we just wait for the revision of tools. It will be awhile yet, but we're getting there.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Oct 11th, 2010 at 10:36pm

Quote:
1.2.
each line indicates what one hexadecimal binary number/s
equates to in text form


This is somewhat diffuse. I suggest "each line indicates what string of text a sequence of hexadecimal bytes equates to."


Quote:
2.1.3.
0001=Seven

           If a byte sequence 0x00 0x01 is encountered


First off, this is in the wrong area. I'd put 2.1.3.-2.1.5. that under 2.2., because they directly relate to normal entries, whereas their relation to encoding is pretty much non-existent, except when you mix file encoding and file syntax together, which you shouldn't.
Secondly, the byte order is at no point made explicit. Of course, we're always talking Big Endian for table entries, but that's only clear from examples, not from a direct statement.


Quote:
2.1.4
Text COllisions:


Typo.


Quote:
2.1.3.
When hex values overlap, the largest hex value should always be used.


"Longest hex sequence." The largest value would be ambiguous, cf.
    00=5
    EE=6
    00ED=7


EE is larger than ED. Of course this is not what is meant at all by your statement.


Quote:
2.1.4.
When text values overlap, the largest text value should always be used.


"When text values overlap the entry that represents the longest prefix for the current string of text shall be used."


Quote:
2.2.
     You can also do multi-byte entries like this:
     You can multi-character entries:
     You can combine the two:


    Multi-byte entries look like this:
    Multi-character entries look like this:
    A combination of the two looks like this:



Quote:
2.2.
2.2.1 No Entry Found Behavior for Dumping


"No-Entry-Found Behavior for Dumping" See Hyphen - Compound Modifiers.


Quote:
2.2.
Expected behavior in the event no table match is found for a given hex
value is to output the raw value in the following manner:


"no table entry is found to match a given byte [sequence]"

I think you mean this to be parsed byte-wise, yet a "hex value" that cannot be found could also be "998877". Is expected behavior to put this as "<$998877>"? This also swerves the subject of Endianess.
I would mention "byte sequence" above and then explain that each byte is printed separately, however, defining this per byte is also a resolution.


Quote:
2.2.
Note: In the event the "<$XX>" string may overlap or conflict with valid
table entries, hex value insertion should take precedent.


Either it overlaps or it doesn't. There are no entries that may overlap. Also, weak language again.


Quote:
2.2.
Expected behavior in the event no table match is found for a given text
value is to ignore the character and make no hex insertion.


"no table entry is found to match a given text sequence"


Quote:
2.3.
These codes are used by dumpers only and will be ignored by inserters.


Move this up to the start, right after "There are a set of formatting control characters you can use in any of
your table entries to control the formatting and output of your script.". So it will actually be read in context and not get lost in the example.


Quote:
2.3.
there can be no literal representation
of the control code character sequence


"there can be no literal representation of control code character sequences"


Quote:
2.3.
There are a set of formatting control characters you can use in any of
your table entries to control the formatting and output of your script.


"There is a set of formatting control characters any table entry may use to control the formatting and output of the script."


Quote:
2.3.
For flexibility purposes, you can use script formatting values like "\n" to
do something like this for line break and end string control codes:

This will produce something like this at the end of a string:


"For flexibility purposes, control codes like "\n" can be used to achieve effects like the following for line breaks and end string entries:"

"This will produce the following at the end of a string:"


Quote:
2.3.
"Commenting characters"


"Comment delimiters" (w/o quotes)


Quote:
2.4.
In actual text output, your line breaks would still be
controlled via "\n".


-your
Unnecessary and distracts from statement IMO.


Quote:
2.4.
The only requirement is end token hex values must be preceded by the '/' [...]

You may have as many end token entries as you need.


"End token entries must be preceded by a "/" [character]."
"There may be an unlimited number of end tokens."


Quote:
2.4.
A typically string end token might look like this in your table:
You can use any combination of formatting controls and text
representation. This allows for nearly any variation of a string end you want.


"typical"
"Any combination of formatting control codes and text representation may be used. This allows for nearly all variation of string ends."


Quote:
2.4.
In some cases, such as fixed length inserting or other situations, there
may be instances where no actual hex value should be dumped or inserted
when the string end is reached. The following format is acceptable in these
situations.


How can no actual hex value be dumped? If I don't want a particular hex value to be included, there surely has to be a way. Not setting a string for any entry is forbidden per 2.1.5., though.


Quote:
2.5.
[I]f you want to print 2 following bytes after a certain
control code is read [...]

           $0500=<Color>,1


Example is flawed. Also notice typo "If". Also reword without "you".


Quote:
2.6.
Multiple table files is a flexible [...] and dictionary.


"Support of [m]ultiple ... and dictionaries."


Quote:
2.6.
The "trigger" hex value must be preceded by the '@' character.


Seems like it is supposed to be preceded by the '!' character, actually.


Quote:
2.6.
NumberOfTableMatches is the number of table matches to match before falling back to the previous
table. Setting this value to '0' indicates indefinite matching should be done in the new table until an entry is not found in the new table.


"NumberOfTableMatches is the non-negative
number of table matches to match before falling back to the previous table. Setting this value to '0' indicates indefinite matching should be done in the new table until no matching entry is not found in the table that was switched in."

I also highly recommend to state that TableID may not contain "," and possibly should have a minimum length of some number of characters, so clashes will occur less often.


Quote:
2.6.
Let's Assume we start with table HIRA.


"assume"


Quote:
HIRA --> KATA --> ア --> イ --> ウ --> KANJI --> 意 --> 0x03 fallback to KATA --> 0x03 fallback to HIRA --> HIRO.


I am aware that this is my own example, however, HIRO is confusable with HIRA on a quick read-through. I suggest you change this to "<Playername>" or some other example that cannot be easily confused.


Quote:
2.6.


I would like to stress the round-trip case some more.

As I had it in mind while designing this, that a table change wouldn't automatically invalidate the old table change. So changing from a table 1 that is limited to a number of matches (like 3) to another table 2 that is limited, too (like 5), while not being really feasible, would be possible.

However the match would itself count as a match in table 1. I realize this is a borderline crazy case, but I think that this is the only way all cases can be accounted for.

I think epistemic problems only arise when changing from table 0 to limited table 1 and then in there change to a limited table 0 again. Practically, it's not easy to tell one way or the other, so I'd think this would stay the same as the other cases for ease of implementation. If some crazy game out there does something like that, it's too bad.


Quote:
3.0 How to Handle Common Situations


At least include a little prelude. "This section exemplifies how to handle common problems with..." or some-such.


Quote:
3.1.


In 1.2. you referred to the whole line as an entry, yet, now you only mark the right-hand side to be an entry. I suggest a cleaner explanation or maybe "DictEntry1" etc. In any case, reference to the table file itself and the example dictionary should be clarified.


Quote:
3.1.
In this case, every time hex code 0x30 is encountered, the table will
switch for one math


"match"


Quote:
3.2.
"Normal Entries"


It's not in quotes anywhere else.


Quote:
3.2. & 3.3.
See section 2.7      Table Switching for more information.


It's 2.6.


Quote:
3.2.
special characters


Believe it or not, some people actually consider these to be not special at all and would feel affronted.


Quote:
3.4.
This is a Japanese language specific issue. In short, Hiragana/Katakana
represent the same written syllables in two forms (one for transcribing
foreign words).


"(one for transcribing foreign words)" is not factually accurate and should be cut. Also, it should be "Japanese-language specific", see Hyphen - Compound Modifiers.


Quote:
3.4.
Hiragana/Katakana mark


Those are not marks, suggest "switching".


Quote:
4.1.
It's best practice when
reading the table file, as well as dealing with escape codes (\n) to
handle both types.


Those are actually three types.


Quote:
4.1.
Use encoding aware text/string processing
functions.


"Encoding-aware text/string-processing functions".


Quote:
4.2.
Please be aware some features of the table file format should be treated
different depending upon whether the task is dumping or inserting.


"differently"


Quote:
4.2.
    1.
    2.
    3.
    4.


You didn't use numbered lists for any other section. And here, it's not even required to rank these items.


Quote:
4.2. 3 & 4
Linked Entries are only needed for dumping purposes. They act as a
normal entry for inserting. Inserting would flow through the stream as
normal inserting seeing the hex values in the script.


So it wouldn't catch that "<Color>" means 0x05 0x00? That doesn't sound right...


Quote:
4.3. 2
EndTokens


First time this is written like that.


Quote:
4.3. 3 & 4
Array entries


This is the first time this term is used. You talked briefly about array encoding, but I think this is a remnant of the arrays in the former draft.



The indentation of all chapters after 1 does not follow the same rules as chapter 1. Chapter 2 uses irregular indentation for 2.3. and 2.4.

In general I have to say your language is not very normative. There are shoulds everywhere. I would recommend changing most of those to shalls and musts.

Also you overuse "hex value". Value is something that has to be measured in units. Yet, you actually talk about those units (bytes etc) while saying value. IMO hex value does not qualify for what you're trying to say. It's a sequence of hexadecimal bytes etc. Sequence as in they are stored sequentially in the file (indeed, I know many specs that make this very explicit by explaining BE and LE all over again).

As I have shown above, I would personally eliminate all usage of "you".

cYa,

Tauwasser

[edit]No italics everywhere anymore.[/edit]

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Oct 25th, 2010 at 3:22pm
Thank you for that pain-staking edit. It certainly needed it and the document has definitely increased in quality as a result. :)

I agreed with and implemented nearly all changes. I will try to be more cognitive of my passive language going forward. I left a few 'shoulds' and 'yous' in section 3 and 4 for suggestions, as they are not absolute.


A few other questions:


Quote:
*2.2.1
Expected behavior in the event no table match is found for a given hex
sequence, when reduced to a single byte, is to output the raw value in the following manner:


I am unsure how to express what I want to say here. You've got a given a longest hex sequence of 3 bytes.  The next 3 hex bytes are say $9b7733. If $9b7733 is not found, you search for $9b77. if $9b77 is not found, you search for $9b. If $9b is not found, you finally output <$9b>. Next iteration you start searching for $7733XX That's how dumping should work. I'm not sure how to express that here.


Quote:
2.4
An end token can be defined having no actual hex sequence associated
with it. Insertion utilities can use these as indication of end of
string for pointer operations, but not insert an actual hex
representation, such as in the case of fixed length strings. The
following format is acceptable in these situations.


I'm unsure how to better express this functionality. See Atlas 1.1 documentation Page 11 and Cartographer's readme artificial control code feature. There are cases where it's desirable to have an end token dumped where no end token hex sequence is available. During insertion, the end token indicates end of string for pointer calculations, yet no actual hex sequence needs to be inserted.


Quote:
2.6

Tau, you mentioned no ',' in the table id string. Why not? All characters should be valid here. The only requirement is it starts with '!'.


Quote:
2.6
I would like to stress the round-trip case some more.

As I had it in mind while designing this, that a table change wouldn't automatically invalidate the old table change. So changing from a table 1 that is limited to a number of matches (like 3) to another table 2 that is limited, too (like 5), while not being really feasible, would be possible.

However the match would itself count as a match in table 1. I realize this is a borderline crazy case, but I think that this is the only way all cases can be accounted for.

I think epistemic problems only arise when changing from table 0 to limited table 1 and then in there change to a limited table 0 again. Practically, it's not easy to tell one way or the other, so I'd think this would stay the same as the other cases for ease of implementation. If some crazy game out there does something like that, it's too bad.


I'm not following this example. Please explain.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Oct 27th, 2010 at 1:06pm

Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I am unsure how to express what I want to say here. You've got a given a longest hex sequence of 3 bytes.  The next 3 hex bytes are say $9b7733. If $9b7733 is not found, you search for $9b77. if $9b77 is not found, you search for $9b. If $9b is not found, you finally output <$9b>. Next iteration you start searching for $7733XX That's how dumping should work. I'm not sure how to express that here.

I know what you want to say here. I would mention here that only ever one-byte misses are possible, because finding no match for any entry at offset n does not imply there is no entry that fits offset n+1.


Quote:
In the event no table match is found for a given hex sequence, the first byte of the sequence must be output as a raw value in the following matter:
[...]
Note: This directly follows as there might be a matching entry for the hex sequence starting at its second byte.

The search paradigm of taking the longest match to a hex sequence is dealt with in 2.2.3.


Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I'm unsure how to better express this functionality. See Atlas 1.1 documentation Page 11 and Cartographer's readme artificial control code feature. There are cases where it's desirable to have an end token dumped where no end token hex sequence is available. During insertion, the end token indicates end of string for pointer calculations, yet no actual hex sequence needs to be inserted.


I find the current wording not too bad. It explains why this is desirable and demonstrates this. If I had to word it, I would only cut that run-on sentence to two separate sentences:

Quote:
An end token can be defined having no actual hex sequence associated with it.
Insertion utilities can employ these tokens as end-of-string indicators, e.g. for various pointer calculations and fixed-length strings. No actual hex sequence will be inserted in these cases.
The following format is acceptable in these situations.



Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
Tau, you mentioned no ',' in the table id string. Why not? All characters should be valid here. The only requirement is it starts with '!'.

I thought of this more in the context of generalizing table changed and linked entries. You don't allow U+002C COMMA in linked entries' names ― most likely for easy parsing of the linked entry syntax:

Quote:
$hexadecimal sequence=label,decimal number



Nightcrawler wrote on Oct 25th, 2010 at 3:22pm:
I'm not following this example. Please explain.


The example is as follows: Suppose we have three tables. Table 0, 1 and 2. Suppose they have table Ids as indicated below and normal entries for all hex values 0x00 through 0xBF.

Table 0 contains C0=TABLE1,3. Table 1 contains C1=TABLE2,5. Table 2 is not so important for this example, but does not change tables anymore for the sake of simplicity.

Now we have the following parsing dilemma:



The following two solutions come to mind:


I think the most rational choice out of this dilemma is solution I.

However, as I implied in my original post, there might be times when a games switches between tables inside each other and expects these changes to basically reset some sort of "table stack". However, I have yet to see any games actually do that, so I'd say implementing solution I is pretty safe and the above example seems constructed to specifically break this implementation as it doesn't seem sensible to me to implement table switching that way.

So I hope to have cleared that up :)

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Oct 28th, 2010 at 3:29pm
Ok. I tried to fix up 2.2.1 and 2.2.2 a bit with the wording on only one byte/character misses being possible. I didn't want to get too wordy for fear of further over-complicating a simple matter.

I made a note on no comma for TableID. I was only thinking about the TableID line and wasn't thinking about the TableID being also used on the table switch entry line, which of course can't have a comma without complicating parsing. Doh! *headsmack*.

I agree with the table switching itself being counted toward the number of matches. In fact, that's how I thought it was already implied to work. I added a small note to make it clear.


Embedded Pointers:

One new item came to the table recently as I was working on TextAngel, compatibility with Atlas, and support of this table format. The one big thing Atlas handles that my dumper doesn't is embedded pointers. Maybe this doesn't belong, but it's very similar to linked entries and it could possibly be very useful. I thought it was worth thought and consideration if nothing else.

It's kind of like a 'linked entry' where a control code is encountered and one or more placeholders for embedded pointers would follow. I can see this defined in a similar manner. Say

FC=<yes/no>,2

<yes/no> control is encountered, we know there will be two embedded pointers afterward and output "<yes/no>#EMBSET(0)#EMBSET(1)" to the script file. That would get us the placeholders (#EMBSET commands) However, I can't think of a sensible way to define to the dumper how to figure out when to write to the place holders (To get the #EMBWRITE commands). Atlus examples suggest you would just use the next end tokens. However, when you hit an end token, how do you know whether an embedded pointer should be written or a normal pointer in the larger table? It seems like you just write to the embedded table until it's exhausted (no more defined #EMBSETs to write to)and then fall back to the main pointer table.

I don't know how flexible that would be. I think I recall seeing games with a similar yes/no embedded setup, but different behavior.

Anyway, this might be something to consider trying to define for the table format? It might be a large leap forward as no available dumpers handle this in any capacity that I know of. It would be a dumping issue only. We obviously don't want to get into pointer specific information in the table file though. I was trying to keep it abstracted, however it seems the number of pointers and number of bytes will come into play. I'm not sure how it could be done otherwise.

Thoughts?

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Nov 1st, 2010 at 9:31am

Nightcrawler wrote on Oct 28th, 2010 at 3:29pm:
Anyway, this might be something to consider trying to define for the table format? It might be a large leap forward as no available dumpers handle this in any capacity that I know of. It would be a dumping issue only. We obviously don't want to get into pointer specific information in the table file though. I was trying to keep it abstracted, however it seems the number of pointers and number of bytes will come into play. I'm not sure how it could be done otherwise.

I think this is a big minus. It's very specific to add pointer information and it might also be console specific. I have personally never encountered a game that uses this format on the GBC, because most separate control flow from the text shown.
Maybe the programming paradigm was different back in SNES days?

Also, personally I think defining this as a linked entry and then going over it with RegEx or a Beanshell script is much more flexible and accessible.

This really boils down to your perception of a table file.


Quote:
[A table file]'s sole purpose is to act as a hex to text and text to hex encoding file. Basically, this means it's a table to turn binary data into text and vice versa.

I think including pointers in there does not match this definition anymore. It would not only translate text to hex and vice versa, it would also need translate logic embbedded in the text.

To reiterate, personally, I think we'd be better off letting those things be a linked entry and then formatting files with regular expressions/Beanshell scripts. That way you
  • have no hassle with including pointer formats and lengths in the table file, and
  • can leave the further processing up to the user via much more dynamic processes including scriptable conversions.

For instance, Beanshell is basically Java. So you can use its RegEx capabilities to capture groups, modify the data contained in there however you like, put the string in a stringbuilder and output that.
Using specific ATLAS commands will only glue the user to having to use ATLAS for insertion or somehow get the data that was replaced by EMBSET back if he cares for literal data.

EDIT: Is there any particular reason why entries with blanks on the right-hand side are still illegal?

I think they're perfectly acceptable, for dumping and for inserting.

cYa,

Tauwasser

[edit]Comment added.[/edit]

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Nov 5th, 2010 at 11:02am

Tauwasser wrote on Nov 1st, 2010 at 9:31am:
I think this is a big minus. It's very specific to add pointer information and it might also be console specific. I have personally never encountered a game that uses this format on the GBC, because most separate control flow from the text shown.
Maybe the programming paradigm was different back in SNES days?

Also, personally I think defining this as a linked entry and then going over it with RegEx or a Beanshell script is much more flexible and accessible.

This really boils down to your perception of a table file.


Yes. I agree. I have given this more thought. It doesn't belong shoe-horned in the table. Even if the definition is stretched, there's no way to abstract the details enough and adding anything Atlas (or any utility) specific is not desirable.  It is indeed better handled with post processing.


Quote:
EDIT: Is there any particular reason why entries with blanks on the right-hand side are still illegal?

I think they're perfectly acceptable, for dumping and for inserting.


They are illegal because they invalidate the map. You cannot map something to nothing or nothing to something.

$FE=
$FC=

What does this example say? Hex sequence FE and FC map to 'nothing' for dumping. And we're saying 'nothing' maps to FE and FC for insertion. This is an illogical fallacy. It's definitely illegal for insertion direction. It has two of the same text sequences mapping to different hex sequences. If it's not part of the map, why it is in the table file? Klarth and I thought both invalid map sequences like this, and unrecognized lines in general, be illegal and generate error.

At best, it could be valid for dumping direction only depending on your perception of 'nothing'. I suppose it's the only way to skip processing of a particular hex sequence altogether and output nothing at all. Actually, that may have been my original intention, to make invalid in one direction only (not totally invalid like it says).


With that explanation, do you still think they are perfectly acceptable? If so, why?

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Nov 7th, 2010 at 12:13pm

Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
They are illegal because they invalidate the map. You cannot map something to nothing or nothing to something.

First off, they do not map "something to nothing". They map a hex sequence to the empty string and thus do not invalidate the map. Secondly, I have seen games that do this.
Now I will pull a Pokémon example out of my hat again, but 0x35 does exactly this in Gold/Silver. They are simply skipped.
What purpose does this serve?
  • You can implement variable-width player names without much hassle. Many NES games, for instance, tend to print a fixed 5 characters for the name. This results in redundant whitespace.
    However, if the name is stored with ignorable characters to fill the five character limit, all printing will be correct the first time around, without much legwork.
  • You can programmatically fix British English to be American English for "-our/-or" cases like colour vs. color.
    This way, you don't have to QC the assembly again, because the script will have the same size in bytes (and indeed, the replacement could even be made in a compiled image).


Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
This is an illogical fallacy.

That one made me snicker :)


Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
It's definitely illegal for insertion direction.

As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.
This is even evident by the current wording of the spec:


Quote:
Expected behavior in the event no table entry is found to match a given text sequence is to ignore the character and make no hex insertion.

You want to ignore a character, so the empty string could not have been included in the search for a map in text ⇒ hex direction to begin with, as it contains no characters.


Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
If it's not part of the map, why it is in the table file?


How does this invalidate a map? It's still a map, just not a injective one.


Nightcrawler wrote on Nov 5th, 2010 at 11:02am:
With that explanation, do you still think they are perfectly acceptable? If so, why?


I hope I have argued my point of view and why I think its perfectly acceptable.


I just noticed, that the doc still has "No Entry Found" instead of "No-Entry-Found" in most places. It's even inconsistent with the index. Indeed, it says "[...] Insertion" in the index while it says "[...] Inserting" in the body.
And yes, I am pedantic :) However, I'm just trying to help improve the spec and get common cases "under the lid" here.

cYa,

Tauwasser

[edit]Yeah, I meant injective, not surjective. Sorry about that. I think that part was clear though.[/edit]

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Nov 9th, 2010 at 1:52pm

Tauwasser wrote on Nov 7th, 2010 at 12:13pm:
First off, they do not map "something to nothing". They map a hex sequence to the empty string and thus do not invalidate the map. Secondly, I have seen games that do this.


I figured that's what your perception of 'nothing' would be when I wrote my last response. This is only valid in a limited sense. An empty string is a text sequence like any other. Let's look at this:

35=
7B=

That's fine for dumping. However, you can have only ONE of those type of values without issue for insertion. Otherwise, if the table is loaded in the insertion direction, you have an (illegal) duplicate where the same text sequence maps to different hex sequences.  You can't have duplicates in a valid hash map.

See the issue there? You can't use that same table for insertion.

Opposite counterpart
=test

Besides the same issues as the other case, this is illegal because an 'empty' hex sequence is malformed and not allowed. Also, we did define when no insertion match for a text sequence can be made, it is ignored, which is the same desired end result.



Quote:
As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.


Right. You just reiterated my point. You just said it is not included in the map. It's illegal for the map in one direction, and anything illegal for the map generates error.

You seem to suggest making special exception during processing to ignore that type of line for insertion, while I believe no special exception should be made, and it should generate an error when loaded in that direction, because it is not part of the map.


Quote:
I hope I have argued my point of view and why I think its perfectly acceptable.


I understand your point of view, however I believe it is valid in one direction only (or in both directions if there is only one such entry of this type). In the cases where it is not valid, I would expect an error. You can still do what you want, you just can't load the same table in the dumping and inserting direction when they have duplicates like that. In fact, the same concept is applied to any duplication in hex or text sequences, which I forgot to include in the document.

In summary, what I'm saying is I would propose this amendment to make it more clear:


Quote:
2.2.5 Illegal Sequences:
     
           Duplicate Entry:
           
                 00=test
                 00=test
                 
                 Full Duplicate entries are not allowed and shall generate error.
                 
                 00=test
                 01=test
                 
                 Duplicates text sequences are not allowed when the table is loaded in the inserting direction.
                 The same text sequences cannot map to multiple hex sequences for inserting.
                 
                 00=test
                 00=test2
                 
                 Duplicate hex sequences are not allowed when the table is loaded in the dumping direction.
                 The same hex sequence cannot map to multiple text sequences for dumping.
                 
           Blank or Empty entries:
           
                 00=
                 01=
                 
                 Multiple blank text sequences are not allowed when the table is loaded in the inserting direction.
                 The same 'empty string' text sequence cannot map to multiple hex sequences for inserting.
                 
                 =test
                 
                 A 'blank' hex sequence is not allowed.


What do you think about that?


Quote:
I just noticed, that the doc still has "No Entry Found" instead of "No-Entry-Found" in most places. It's even inconsistent with the index. Indeed, it says "[...] Insertion" in the index while it says "[...] Inserting" in the body.
And yes, I am pedantic :) However, I'm just trying to help improve the spec and get common cases "under the lid" here.


Corrected in my working copy. I don't mind the feedback. This document isn't that popular, so few are interested in reading it over in detail.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Nov 10th, 2010 at 10:05pm

Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
Let's look at this:

35=
7B=

That's fine for dumping. However, you can have only ONE of those type of values without issue for insertion.


See below.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
You can't have duplicates in a valid hash map.


I argue very much against this point. It would just be a map to a string array, which is perfectly valid.

Of course, using a naïve implementation would also not break the hash map. Most language I have used replace values when another entry with the same key is "added".
So then the insertion process would be determined by chance:


Code:
3B=Test
7B=Test


Might insert 0x3B, or 0x7B, depending on the programming language, the order the table is read and the will of the programmer. However, from a practical point of view, since both of these hex sequences map to exactly the same "Test" in reverse-lookup, there is no problem using only one of these values for all the occurrences of "Test". If there were a problem, the hex sequences cannot mean the same "Text" under all circumstances are therefore not to be considered the same (so a premise is invalidated).

More problematic would be the following:


Code:
3B=Test
7B7B7B7B7B=Test


This would greatly increase the script size upon insertion, however, it would not break the hash map with the naïve implementation above.

A clever implementation might implement a map from strings to a hex sequence array and use the shortest sequence available or alternatively do some simple math while reading the table (or when "adding" doesn't replace the value, but throws an exception) like so:


Code:
        'Regular expression for matching
        Dim regObj As Regex = New Regex("^([a-fA-F0-9]{2,})=(.*)$")
        Dim matchObj As Match = Nothing
        'ReadLine, init to != Nothing
        Dim readLine As String = ""
        Dim hexGroup As String = Nothing
        Dim textGroup As String = Nothing

        'Prep hashtables, case-sensitive
        hexTextHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)
        textHexHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)

        Do Until readLine Is Nothing

            readLine = reader.ReadLine()

            'Don't care about empty lines
            If (Not String.IsNullOrEmpty(readLine)) Then

                'Match using regEx
                matchObj = regObj.Match(readLine)
                'Get at least two groups, or skip!
                If (matchObj.Success) Then

                    'Match Group 0 is entire regex Match
                    'Match Group 1 is hexadecimal
                    'Match Group 2 is text

                    'Make sure we got an even number of hex digits
                    hexGroup = matchObj.Groups(1).Value.ToUpperInvariant
                    textGroup = matchObj.Groups(2).Value
                    If hexGroup.Length Mod 2 <> 0 Then Continue Do

                    'Add to tables
                    hexTextHashTable.Add(hexGroup, textGroup)

                    ' If (not (is in table)) OR (isin table, but value in table is longer hex sequence)
                    If (Not textHexHashTable.ContainsKey(textGroup) OrElse textHexHashTable(textGroup).ToString().Length > hexGroup.Length) Then

                        'This will not throw an exception when key textGroup is already in dictionary.
                        textHexHashTable(textGroup) = hexGroup

                    End If

                End If

            End If

        Loop


This is a working Unicode-supporting table class I actually wrote some days ago. Notice while I'm using Dictionary(Of String, String), I might as well use Hashtable. Hashtable is not typesafe in VB.NET, Dictionary is. The syntax is exactly the same for both, meaning it won't throw exceptions either.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:

Tauwasser wrote on Nov 7th, 2010 at 12:13pm:
As far as I am concerned, the insertion problem is about finding a match for the longest prefix of size ≧ 1 (in characters) of a given text sequence at a given position [in the text] in the table file and inserting its [the match's] hexadecimal sequence into the rom at a specific position.
Per definitionem, the empty string has a size of 0, so it is not included to find a map for it in the insertion process.


Right. You just reiterated my point. You just said it is not included in the map.


Nope. I just said it is not included in the search! It can be included in the map. The point here is, you never look for the empty string in the first place, because it is empty.
Therefore, it doesn't matter whether it is included in a map of not, because you will never look for it, much like you never look for a text match for the hex sequence of length zero.
Therefore the special case of having a non-injective set of table entries for the empty string text sequence coincides with the case of having any non-injective set of table entries (on the text side).


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
Otherwise, if the table is loaded in the insertion direction, you have an (illegal) duplicate where the same text sequence maps to different hex sequences.
[...]
It's illegal for the map in one direction[...].


Just to reiterate, it is not illegal. There is nothing per sé illegal about a map that is not injective in this case.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
You seem to suggest making special exception during processing to ignore that type of line for insertion, while I believe no special exception should be made, and it should generate an error when loaded in that direction[...].


I suggest that the empty string does not impose the need of special exceptions for insertion because it was never included in the insertion problem to begin with.
I would, however, suggest that special exceptions can be made for non-injective entries in general, like I have shown above.
One basically does not need to care about these cases at all in most programming languages and the creation of a hashmap will work. This comes with the trade-off that the length of the mapped hex sequence for any string might not be the shortest in the table.
However, basically one simple line in a very basic implementation can already take care of this scenario without the need to impose somewhat complex algorithms and heuristics (as are needed for an optimal insertion with multiple tables for instance).


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
I believe it is valid in one direction only (or in both directions if there is only one such entry of this type).

You come back to non-injective maps being "illegal" over and over again. Thereby, you exclude a very common use case! Off of the top of my hat, RHDN sometime last week: Here. Granted, the user will probably not need to ever import the Japanese script back into the game, however, you want to make him dump the script with "物1" and "物2" just to search-and-replace this then in all his script files? C'mon!
Also, the ease with which I could come up with a real-life example should frighten you, because it could mean a v1.1 of the spec rather soon after its release.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
[Y]ou just can't load the same table in the dumping and inserting direction when they have duplicates like that.

Well, I have shown above that already a very simple implementation can handle this problem ― and it isn't the most simple, which I have also mentioned: For a by-chance implementation, you would not even need to check if a given key is already in the hashmap. You just need to add it to the hashmap and the programming language's standard implementation will silently overwrite any duplicate. It doesn't get more easy than that from a practical point of view.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
In fact, the same concept is applied to any duplication in hex [...] sequences [...].

The hex sequence cannot have duplicates per definitionem, I'm not arguing those cases.

In all actuality, I thought that was what you meant with your example:

Quote:
Duplicate Entry:
           
                 00=test
                 00=test
                 
                 Duplicate entries of any kind are not allowed and shall generate error.


Now over the course of this dialogue it seems to surface that this paragraph is not only talking about duplicate hex sequences ― how I interpreted it before ― but actually also about duplicate text sequences as well, which I had not assumed.


Nightcrawler wrote on Nov 9th, 2010 at 1:52pm:
What do you think about that?

I still find exclusion of non-injective tables as well as text sequences containing only the emptry string to be an unnecessary limitation of this spec and therefore to be null and invalid.


I can see the following use cases per entry of a table file:
  • Hex and text sequences are unique:
    There is no problem identifying corresponding pairs in any hashtable implementation. The empty string is included.
  • Hex sequence is unique, text sequence is duplicate:
    There is no ambiguity in dumping direction. The empty string is included. There is ambiguity in insertion direction. I have shown above how this can quickly and sensically be resolved with a simple heuristic of taking the shortest hex sequence in the table for insertion. Either the text sequences are identical under all circumstances and therefore it does not matter which hex sequence is inserted, or they are not identical and the hex sequences should therefore not map to identical text sequences to begin with.
  • Hex sequence is ambiguous, text sequence is unique or ambiguous:
    Ambiguous hex sequences are forbidden per definitionem as no good case can be made for preferring one hex sequence over another hex sequence based on their text sequences. This case is therefore to be forbidden.
  • The hex sequence is the empty hex sequence, text is unique of ambiguous:
    The empty hex sequence is nonsensical, because it is always a prefix sequence of itself. Therefore there would be an unlimited number of empty hex sequences to be dumped between two bytes of the file in question. This case is therefore to be forbidden.


Before you try to make a case against the empty string out of the argument that it is always a prefix of itself (which is true): It is still not included in the insertion problem, because it has length 0 and only prefixes of length ≧ 1 are included in the insertion problem.

I hope I have shed some light onto my reasoning and why I think the non-injective-table case can be dealt with in an easy, comprehensive and logical fashion.

cYa,

Tauwasser

[edit]Edited mistake in example code. Some language refinements.[/edit]
[edit]Some more explanation regarding 物 duplicate over at RHDN.[/edit]
[edit]Yeah, I meant injective, not surjective. Sorry about that. I think that part was clear though.[/edit]

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Nov 11th, 2010 at 9:50am

Tauwasser wrote on Nov 10th, 2010 at 10:05pm:
I argue very much against this point. It would just be a map to a string array, which is perfectly valid.

Of course, using a naïve implementation would also not break the hash map. Most language I have used replace values when another entry with the same key is "added".


Really? What languages? All the .NET languages, C++, and Java don't.

See Dictionary.Add()
"ArgumentException - An element with the same key already exists in the Dictionary<TKey, TValue>."

C++ does not either.
"Because map containers do not allow for duplicate key values, the insertion operation checks for each element inserted whether another element exists already in the container with the same key value, if so, the element is not inserted and its mapped value is not changed in any way."

Java does not either.
"An object that maps keys to values. A map cannot contain duplicate keys; each key can map to at most one value."

There are other ways you can accomplish the map and allow duplicates, but why make it more difficult, be less efficient, and likely require more code? A basic hash map is all that is needed. Nearly all implementations I know of don't like duplicate keys.


Quote:
So then the insertion process would be determined by chance:


I don't like that at all. No behavior should be undefined.



Quote:
A clever implementation might implement a map from strings to a hex sequence array and use the shortest sequence available or alternatively do some simple math while reading the table (or when "adding" doesn't replace the value, but throws an exception) like so:

This is a working Unicode-supporting table class I actually wrote some days ago. Notice while I'm using Dictionary(Of String, String), I might as well use Hashtable. Hashtable is not typesafe in VB.NET, Dictionary is. The syntax is exactly the same for both, meaning it won't throw exceptions either.


Your code illustrates the extra processing required to avoid trying to add duplicate keys to the dictionary. My aim is to stay away from needing 'clever' implementation and head toward sheer simplicity. Most programmers in our community are struggling self learned ones. Why shouldn't you just be able to just split the table line and stick it in the dictionary as-is for all normal entries? Obviously we have a few hoops such as checking for blank lines and validating the even hex, but why keep adding more?

Instead, I think the table should follow a hash map from a conceptual point of view. Allow duplicate values, but not duplicate keys. Keep all cases clearly defined. Why do we have to start making exceptions and adding undefined/chance behavior? I'm just not seeing the need. You can still accomplish what you want to accomplish. You just need a different table for dumping and inserting if you intend to use tables that would result in key duplication when processed in the other direction.



Quote:
You come back to non-surjective maps being "illegal" over and over again. Thereby, you exclude a very common use case! Off of the top of my hat, RHDN sometime last week: Here. Granted, the user will probably not need to ever import the Japanese script back into the game, however, you want to make him dump the script with "物1" and "物2" just to search-and-replace this then in all his script files? C'mon!
Also, the ease with which I could come up with a real-life example should frighten you, because it could mean a v1.1 of the spec rather soon after its release.


You misunderstand how this situation is handled. I haven't excluded any use case, including this one. There would be a difference between the table used for dumping and insertion.

Dumping Table:
01=物
02=物
03=物

There's no problem. There can be duplicate values. You can have as many as you want map to that Kanji.

Inserting Table:

02=物

For inserting, you need to have ONE so it is clearly defined what 物 should map to, and would not cause any duplicate key scenario.

Does that make sense? You can do exactly what you want, you just can't dump and insert with the exact same table. That's probably already the case for many utilities. That is how these situations can be handled and still conform to the paradigm of a hash map. We do not need to break it and add additional exceptions in code.


Quote:
I hope I have shed some light onto my reasoning and why I think the non-surjective-table case can be dealt with in an easy, comprehensive and logical fashion.


I agree 100% with you that these use cases need to be accounted for. I've been around enough years to have certainly seen and used them myself.  They are handled in the manner I showed above with my RHDN case handling example.

It seems our disagreement is on how they should be handled. I really want to stick with following the paradigm of a hash map, simplify implementation, and have clearly defined behavior. The only negative is requiring table modification for dumping vs. insertion in those cases. However, that modification is just having the user clarify what they actually want done in the duplicate key situation.

You'd like to see the hash table paradigm broken in the table file and add the exceptions to the code so the internal hash table never sees them. The disadvantage is the chance behavior and increased code complexity. I understand you could define the behavior, but that would further require code. It would also lock the duplicate situation behavior. My method allows the user to define via table modification what should be done.

Do you agree with the advantage vs. disadvantage analysis of both?

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Nov 12th, 2010 at 10:01am

Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Really? What languages? All the .NET languages, C++, and Java don't.

I already showed above that VB.NET, C++.NET and C#.NET do support this!

MSDN HashTable


Quote:
You can also use the Item property to add new elements by setting the value of a key that does not exist in the Hashtable; for example, myCollection["myNonexistentKey"] = myValue. However, if the specified key already exists in the Hashtable, setting the Item property overwrites the old value. In contrast, the Add method does not modify existing elements.

So while the add method does not do this, it works as described in my example above using the item property! No more code is needed!

For C++ see here: Operator [].


Quote:
If x matches the key of an element in the container, the function returns a reference to its mapped value.

If x does not match the key of any element in the container, the function inserts a new element with that key and returns a reference to its mapped value.

Java: HashMap#put()


Quote:
Associates the specified value with the specified key in this map. If the map previously contained a mapping for this key, the old value is replaced.

All of the main programming languages, which you mentioned, support this scenario already!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
There are other ways you can accomplish the map and allow duplicates, but why make it more difficult, be less efficient, and likely require more code? A basic hash map is all that is needed. Nearly all implementations I know of don't like duplicate keys.

I have just shown that nothing you just said holds. It does not require more code, it works with a simple HashMap type in all these language, it does not duplicate keys, it's not ambiguous and it's not less efficient!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
I don't like that at all. No behavior should be undefined.

It's not an undefined process. The lengths of the hex sequence is determined, as mentioned above, how the programmer codes his routine (first and foremost)and by the order in which table entries are read in a simple implementation. There is nothing undefined about this. If the table mapping is correct, insertion will yield proper data, just not optimal data size-wise.


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Your code illustrates the extra processing required to avoid trying to add duplicate keys to the dictionary.

No my code shows how easy it is to select the shortest hex sequence for any given text sequence in the table!

The no-brain simple solution is the following (which I also mentioned in my last post):


Code:
        'Regular expression for matching
        Dim regObj As Regex = New Regex("^([a-fA-F0-9]{2,})=(.*)$")
        Dim matchObj As Match = Nothing
        'ReadLine, init to != Nothing
        Dim readLine As String = ""
        Dim hexGroup As String = Nothing
        Dim textGroup As String = Nothing

        'Prep hashtables, case-sensitive
        hexTextHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)
        textHexHashTable = New Dictionary(Of String, String)(StringComparer.InvariantCulture)

        Do Until readLine Is Nothing

            readLine = reader.ReadLine()

            'Don't care about empty lines
            If (Not String.IsNullOrEmpty(readLine)) Then

                'Match using regEx
                matchObj = regObj.Match(readLine)
                'Get at least two groups, or skip!
                If (matchObj.Success) Then

                    'Match Group 0 is entire regex Match
                    'Match Group 1 is hexadecimal
                    'Match Group 2 is text

                    'Make sure we got an even number of hex digits
                    hexGroup = matchObj.Groups(1).Value.ToUpperInvariant
                    textGroup = matchObj.Groups(2).Value
                    If hexGroup.Length Mod 2 <> 0 Then Continue Do

                    'Add to tables
                    hexTextHashTable.Add(hexGroup, textGroup)
                    textHexHashTable(textGroup) = hexGroup

                    End If

                End If

            End If

        Loop

Notice how duplicate keys on the hexadecimal side will throw an exception, while they won't throw an exception on the text-side. It doesn't get easier than that! And the insertion behavior is clearly not undefined.
This implementation will always insert for any text sequence the appropriate hex sequence that was specified as the last occurring entry of that text sequence in the table.
Doesn't seem so undefined to me. It's just not guaranteed to be optimal!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Instead, I think the table should follow a hash map from a conceptual point of view. Allow duplicate values, but not duplicate keys.

Which I have been arguing for all along, because you want to disallow it. And I have gone the extra mile and have shown how programs using the standard language implementation of a hashmap (typesafe or not) can easily and swiftly process these.


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
You can still accomplish what you want to accomplish.

Exactly, and with just one table! Also your usage of undefined to mean behavior that depends on order of operations is clearly problematic!
A simple implementation, code in VB.Net in this post, will be dependent on the order of operations while reading the table. This is not undefined behavior and doesn't trigger undefined behavior for insertion!

However, just one if-clause will make this code be optimal! See code in my previous post!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
For inserting, you need to have ONE so it is clearly defined what 物 should map to, and would not cause any duplicate key scenario.

Which is exactly what the if-clause does. Choose the optimal entry that yields the shortest hex sequence. Why can't this be automated? If somebody wants to insert with a special hex value, he can still do that. However, Joe Shmo can use his dumping table and be happy.


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Does that make sense? You can do exactly what you want, you just can't dump and insert with the exact same table.

Why shouldn't the user be allowed to have the comfort of dumping and inserting with the same table?


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
That is how these situations can be handled and still conform to the paradigm of a hash map. We do not need to break it and add additional exceptions in code.

And my reference implementation doesn't, fancy that! And it's still a hash table! And no unnecessary code was inserted!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
It seems our disagreement is on how they should be handled. I really want to stick with following the paradigm of a hash map, simplify implementation, and have clearly defined behavior.

Behavior is clearly defined, the languages' hashmap implementations can be used with simple code and it's a lot more comfort for the user. What have I missed?


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
The only negative is requiring table modification for dumping vs. insertion in those cases. However, that modification is just having the user clarify what they actually want done in the duplicate key situation.

For any real scenario, as I explained earlier, the keys wouldn't map to the exact same text sequence if they weren't the exact same text sequence to begin with! So I really see no point in duplicating work for the end-user here and not choosing entries that guarantee optimal insertion lengths resp. going with a simple implementation and having the user put the shortest hex-sequence towards the bottom of the map if he desires so.
If the user just cares that his script is inserted and the hex sequences really map to exactly the same text sequences, everything will be in order in any case, simple implementation or not!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
You'd like to see the hash table paradigm broken in the table file and add the exceptions to the code so the internal hash table never sees them. The disadvantage is the chance behavior and increased code complexity.

I have shown it does not require any additional code because many standard language hashmap implementations already work the way that multiple values with the same key just update the hash map.
The disallowed cases here should be the ones I mentioned in my last post.


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
I understand you could define the behavior, but that would further require code.

Again, it doesn't require any more code and I have shown that. Also, we keep coming and going back and forth between table file definition and ease of implementation.
I feel my proposition is true to a table file definition that was always there and has been used in many programs as well as ease of implementation!


Nightcrawler wrote on Nov 11th, 2010 at 9:50am:
Do you agree with the advantage vs. disadvantage analysis of both?

Certainly not. I see no disadvantage for the user. I see the disadvantage of Joe Shmo having to create a duplicate table when the thing he most likely wants can be accounted for in one table!

Jeez, this is such a common scenario and a good case can be made for inserting the shortest hex sequence in cases of exact text sequence matches. I see this as needlessly over-complicating matters on your side.
I see this as a really simple extension to the idea of a hashmap and obviously so do major programming language APIs!
It seems you just don't want it to work, just because.

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Nov 12th, 2010 at 7:22pm

Quote:
I already showed above that VB.NET, C++.NET and C#.NET do support this!


You've shown behavior of those implementations when you to try to stick a duplicate key in there. Since hash tables do not allow duplicate keys, those implementations replace. That still does not change the fact that hash maps do not allow duplicate keys to be stored.


Quote:
I have just shown that nothing you just said holds. It does not require more code, it works with a simple HashMap type in all these language, it does not duplicate keys, it's not ambiguous and it's not less efficient!


Correct, it does not duplicate keys because as I've said, a HashMap does not allow for duplicate keys. When you try to add one, it updates it.


Quote:
It's not an undefined process. The lengths of the hex sequence is determined, as mentioned above, how the programmer codes his routine (first and foremost)and by the order in which table entries are read in a simple implementation. There is nothing undefined about this. If the table mapping is correct, insertion will yield proper data, just not optimal data size-wise.


You just said in your last post "So then the insertion process would be determined by chance:" Anything determined by chance is undefined. :o


Quote:
Notice how duplicate keys on the hexadecimal side will throw an exception, while they won't throw an exception on the text-side. It doesn't get easier than that! And the insertion behavior is clearly not undefined.
This implementation will always insert for any text sequence the appropriate hex sequence that was specified as the last occurring entry of that text sequence in the table.
Doesn't seem so undefined to me. It's just not guaranteed to be optimal!


Yes, I understand. I'm not sure I like table file order dictating what a value should be mapped to. Order should serve no purpose in a hash map. Secondly, if that last occurring entry is undesirable, you'd still have to alter your table to make it work the way you want. In that case, you still need an altered/second table, which would be no different than my solution.

Lastly, it would be undefined behavior unless it was explicitly put in the table standard that last occurring entry takes precedent when there is a duplicate key conflict in the table. Is that what you're proposing?


Quote:
Why shouldn't the user be allowed to have the comfort of dumping and inserting with the same table?


They should be if it's possible. In this case, it's only possible if you rely on a somewhat obscure and non-intuitive (in my opinion) rule that the last occurring entry takes precedent. An end user might surely wonder what would occur having the same text sequences mapped to different hex sequences. They'd have to go looking for specification on this case. With the way I had it, it was completely clear. Only the one you want is allowed. No guesswork. If the user wants to insert, the user clearly tells it what to map to. No reliance on a special rule like the last occurrence.


Quote:
Again, it doesn't require any more code and I have shown that. Also, we keep coming and going back and forth between table file definition and ease of implementation.

I feel my proposition is true to a table file definition that was always there and has been used in many programs as well as ease of implementation!


Yes, I concede the operator functionality requires no additional code. I did not think about using the implementation in that manner. I'm glad you brought that to my attention. I am very used to using Dictionary.Add which throws exception.

I do go back and forth with definition and ease of implementation. It's important to me. First, I would never develop a standard I didn't feel comfortable with implementing. If I don't want to program it, I wouldn't bother working on and releasing it. Second, many in our community are struggling self learned programmers. They're lucky to be able to program anything at all. So, if this has any shot of being adopted at all, it needs to be simple enough that some of those guys might be able to grasp it (Think IPS vs. other patching formats). I fear as soon as we added Table switching though, it probably became out of reach for many of those people anyway. :( So, I am in a struggle bouncing around between what's best, definition, and ease of implementation, and my own personal preference (I'm shepherding this, so I had better like it.). It's a tough balancing act for sure.



P.S. I'm off to Hawaii tomorrow. I will give further consideration and thought on the plain. I won't be able to respond further until the 25th or so.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Nov 13th, 2010 at 11:02am

Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
You've shown behavior of those implementations when you to try to stick a duplicate key in there. Since hash tables do not allow duplicate keys, those implementations replace. That still does not change the fact that hash maps do not allow duplicate keys to be stored.


And I never argued that they would allow duplicate keys. You came up with that. Surely they cannot map identical keys to different values, however, "adding" (or appending or whatever an item-operator would be) will work just as well for duplicate keys as it does for unique keys.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
You just said in your last post "So then the insertion process would be determined by chance:" Anything determined by chance is undefined. :o


Bad wording. What was really meant was that it is dependent on the order the table file is read. I don't support using this naive solution though, I merely argue that it works well for people that just want to get their script inserted, who don't care about optimal script size.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Yes, I understand. I'm not sure I like table file order dictating what a value should be mapped to. Order should serve no purpose in a hash map.


Therefore I propose to take the optimal solution: The shortest hex sequence for any given text sequence should be mapped from text to hex. This can be done with a one-line if-statement.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Secondly, if that last occurring entry is undesirable, you'd still have to alter your table to make it work the way you want. In that case, you still need an altered/second table, which would be no different than my solution.


Why would there be undesirable entries in the table to begin with? Surely, if I don't want 0x3B dumped as "Test", I would not include it?
If "undesirable" is supposed to mean that one hex sequence (optimal or not) must not be chosen for insertion of "Test", then this really boils down to my argument from before:
The two "Test" strings are not identical and therefore the table should not map two different hex sequences to "Test" to begin with.
This is therefore really the responsibility of the guy making the table file.
A hex sequence that doesn't print the exact same data as another hex sequence should not be included in the table file as mapping to the same text sequence. This is exactly the same case as the following:


Code:
00=0
01=0


While 0x01 in the game actually prints "1", not "0". If I now were to insert "0", I could (and indeed should) choose the optimal hex sequence length, which is now either of the two. However, if 0x01 does not mean exactly what 0x00 means, the table is invalid in itself and no further argument needs to be presented IMO.
If a table file faithfully represents what the ROM maps values to and from, everything that is inserted will be valid.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
Lastly, it would be undefined behavior unless it was explicitly put in the table standard that last occurring entry takes precedent when there is a duplicate key conflict in the table. Is that what you're proposing?


I propose putting in there that for any duplicate text sequence one should try to insert the optimal-length hex sequence. However, really, not putting a guideline in there will not hurt from a practical point of view either, as implementations will still work and produce valid data ― presuming the table file is correct.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
In this case, it's only possible if you rely on a somewhat obscure and non-intuitive (in my opinion) rule that the last occurring entry takes precedent. An end user might surely wonder what would occur having the same text sequences mapped to different hex sequences. They'd have to go looking for specification on this case. With the way I had it, it was completely clear. Only the one you want is allowed. No guesswork.


I think minimizing the script size is not guesswork but a good (if not optimal) heuristic for many games on older platforms since this is a linear problem (so minimzed length of each constituent of the data will result in minimized length of the whole data). Of course, if the user happens to want to insert a longer sequence, he still can edit the table file for insertion. However, this is likely not the case for 95% of all users.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
I do go back and forth with definition and ease of implementation. It's important to me. First, I would never develop a standard I didn't feel comfortable with implementing. If I don't want to program it, I wouldn't bother working on and releasing it. Second, many in our community are struggling self learned programmers. They're lucky to be able to program anything at all. So, if this has any shot of being adopted at all, it needs to be simple enough that some of those guys might be able to grasp it [...].


First off, I think all programmers are self-taught. I have yet to experience a programming lesson where I actually see people learning stuff. Most of the times at uni, you either are already a programmer and don't get anything out of the hundredth explanation of data types, or you're a newbie and don't get enough experience out of the programming project to do something for you...

So I think being self-taught is not the problem here. However, I have often times exasperated at how unwilling ― or maybe inept ― some people are when it comes to reading the API. I have seen people reinvent the wheel so many times when literally one swift look at the API would have solved their programming misery and saved them days of work by using an API implementation.

I think I know what you're getting at. There are several options with the propositions as they are on the table:
  • Don't specify how this problem is to be solved. A naïve implementation will be correct and work.
  • Specify that heeding the order from top to bottom when the table is read, where new values for keys update old values for the insertion hash map, must be implemented. This will give the user control so he can rearrange his table accordingly.
  • Specify that the new values for keys update old values for the insertion hash map iff the new values are shorter than the old values. This will be optimal behavior for insertion as the resulting script will be inserted with the smallest size possible. If a user does not want this to happen, which should be the exception presuming their table files are correct to begin with, they can still create different tables for dumping and inserting and have total control.

I would go with the latter, simply because it seems to be the most reasonable use-case of all of them.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
I fear as soon as we added Table switching though, it probably became out of reach for many of those people anyway.


Well, not necessarily. With .NET Linq-Queries over arrays and datatables, it should really be a breeze to find the shortest hexadecimal sequence for a given string out of all tables. Can't say for C++, since I haven't program long enough in it to have been confronted with querying data tables.
There are several libraries out for Java which add Linq-like support to it. So collections can be searched.


Nightcrawler wrote on Nov 12th, 2010 at 7:22pm:
P.S. I'm off to Hawaii tomorrow.


Happy Holiday :D Chill out and enjoy sweet life while it lasts.

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Nov 30th, 2010 at 4:06pm
I'm back! I've hiked an active volcano, snorkeled with dolphins and sea turtles, been above the clouds, seen magnificent waterfalls, and swam a secluded lagoon. Very nice trip!

I've re-read the posts of this discussion and given it some more thought. I still don't like the idea of non-injective tables. I'd prefer to see each normal table file line load 1:1 for the dumping or insertion direction, whichever is chosen. However, I recognize the user convenience of a single table file versus needing two in some of the cases discussed. I also recognize your demonstration that it can be accomplished with minimal code. So, I will concede that the format allow it.

Let's summarize the cases again and make sure we're in agreement:

  • Hex and text sequences are unique:
    'Normal' entry, loaded 1:1 in the dumping direction including empty strings. Empty strings are skipped in the insertion direction. 
  • Hex sequence is unique, text sequence is duplicate:
    Loaded 1:1 in dumping direction. The empty string is included. In the insertion direction, the shortest hex sequence is chosen to add to the map for that text sequence. If multiple hex sequences of the same size exist (mapping to the same text sequence), it does not matter which is chosen. However, if I do specify, I would choose last occurring in the table file. If the text sequence is an empty string, it is not loaded into the insertion map.
  • Hex sequence is ambiguous, text sequence is unique or ambiguous:
    An exception/error should be generated on duplicate or unrecognized hex sequences.
  • The hex sequence is the empty hex sequence, text is unique or ambiguous:
    An exception/error should be generated on empty hex sequences..


So, we'd have two changes. One for illegal sequences.

The only illegal sequences will be:
Duplicate Hex Sequences
$BA=one
$BA=two

Blank Hex Sequences
=one

The second change would be to allow for duplicate text sequences. We specify hex sequences mapping to blank text sequences are ignored for insertion. We specify the shortest hex sequence  should be used when there are duplicate text sequences.

OK?


Quote:
Well, not necessarily. With .NET Linq-Queries over arrays and datatables, it should really be a breeze to find the shortest hexadecimal sequence for a given string out of all tables. Can't say for C++, since I haven't program long enough in it to have been confronted with querying data tables.
There are several libraries out for Java which add Linq-like support to it. So collections can be searched.


I actually haven't used LINQ much. I'm not sure how it would work with all the multiple table examples we've had.

I actually implemented this with a simple .NET Stack() class. It seemed to most closely match the table switching concept as we logically wrote it out to me. The dumper has an an array of loaded table objects with an index for the active table object to use for decoding. When new tables are requested, the current index is pushed to the stack and the new active table object is called on. When a table expires for any reason, the last active table object index from the stack is restored. This allows for as many layers of table jumps as necessary and keeps track to fall back all the way to the starting table if necessary.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Dec 11th, 2010 at 8:36am
Hi, sorry for not anwering sooner, had to take care of some RL stuff.

I think we're pretty much lined up. The only issue I had was the following:


Nightcrawler wrote on Nov 30th, 2010 at 4:06pm:
  • Hex sequence is unique, text sequence is duplicate:
    [...]However, if I do specify, I would choose last occurring in the table file.


[...]

The second change would be to allow for duplicate text sequences. [...] We specify the shortest hex sequence should be used when there are duplicate text sequences.


I guess you're set now for specifying shortest hex sequence for duplicate text sequence and mentioned the other only as another possibility? Either way, I agree with the last statement: go for the shortest hex sequence for a duplicate text sequence.


Nightcrawler wrote on Nov 30th, 2010 at 4:06pm:
I actually haven't used LINQ much. I'm not sure how it would work with all the multiple table examples we've had.


This would only matter for insertion. You would have to query each table individually for a given text sequence. Pretty much a for loop over all tables will suffice in most cases.
Of course, there is additional logic involved for table switching, so as to get the shortest hex sequence for insertion purposes over the whole text, so unnecessary table switching can be found and averted.

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Dec 14th, 2010 at 12:02pm
I've put a new draft revision up. It's mostly edits and fixes, however revision of section "2.2.5 Illegal Sequences" and addition of section "2.2.6 Ambiguous Situations" were done as a result of the discussions Tau and I have had here. I mentioned all cases in a way I thought would be clear to the target audience and fit within the style set by the document. This led me to group the results of our discussion into two groups. One being truly illegal sequences, and the other for cases where ambiguity exists. If you have a better idea for organization or presentation of these cases, I'm open to suggestion.

I believe this will be the last real content change and focus can move to a final edit and format. At this point, I should probably go around one more time and get all interested parties to sign off on the spec.



Title: Re: 'Standard' Table File Format
Post by Tauwasser on Jan 23rd, 2011 at 9:48pm
Hi Nightcrawler,

I finally got to reading through your spec once more from top to bottom.

I uploaded a list of erratas as well as suggestions to my google site. I find there are only minor specification issues left, so you did an excellent job ;)


cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Feb 7th, 2011 at 2:31pm
Thanks for the list. I've been meaning to do another complete edit myself. I will go through in detail at a later date. From what I see after browsing, I will probably have no issue with most changes. I've shifted gears for awhile on this in favor of some progress on other projects. I will get back to it in conjunction with more work on my utility. I've found it gives some good insight to develop a utility using the standard along side it.

Speaking of which, an issue did come to my attention during my utility development with comments. If you recall, in the beginning, it was proposed to use '/n' and '/r' as newline with and without comments. Then after comment by DaMarsMan, it seemed redundant, and unnecessary to do that.

If you look at the example in 2.3, if you want to dump a script with commenting characters of  say "//", we have this:


Quote:
Table Entries:
     FE=<linebreak>\n//
     /FF=<end>\n\n//


In practice, this seems to cause some small issues. First, the very first line wouldn't have any comment characters. It's only AFTER a line break that you'd see the comment characters. Second, you also have an issue with the very last line. Your script file will likely end with an '<end>', which will make for a blank commented line at the end.

I've handled this in my utility by having a template, since users are likely to want some sort of heading or footer information present. In there I allow for a $script variable, so the final output will put your script dump there. So I can easily take care of having the first line commented that way by simply putting '//$script' in it. In fact, that's what I did do for Heracles IV (I used the WIP utility for it). The final line of each file is still a commented blank line, but it doesn't effect insertion in any way.

So, it works, but it just seems a little hackish to me. There's also the need for the utility to know what comment characters are for insertion purposes. So,  since the table does not currently define them, you need to define them again in the utility. It's made me rethink about the old way, and also the option of not having commenting characters present in the table file at all. Instead, that can be pushed entirely to the utility where you'd define you want comments after every line break or end token there. I'm not sure if that's a good idea either as we'd be giving up the ability for any number of line breaks coupled with characters to easily appear for any table entry.

Nothing I've thought of really feels 100% right to me. There seems to be some drawbacks with every idea I've thought of. Any ideas?

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Feb 10th, 2011 at 4:14pm

Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
Speaking of which, an issue did come to my attention during my utility development with comments. If you recall, in the beginning, it was proposed to use '/n' and '/r' as newline with and without comments. Then after comment by DaMarsMan, it seemed redundant, and unnecessary to do that.


I would still find it redundant and unnecessary simply for the fact that \r and \n usually mean other things.


Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
If you look at the example in 2.3, if you want to dump a script with commenting characters of  say "//", we have this:


Quote:
Table Entries:
     FE=<linebreak>\n//
     /FF=<end>\n\n//


In practice, this seems to cause some small issues. First, the very first line wouldn't have any comment characters. It's only AFTER a line break that you'd see the comment characters. Second, you also have an issue with the very last line. Your script file will likely end with an '<end>', which will make for a blank commented line at the end.


I was actually wondering about that, but since I have never used Atlas before, I don't know if this would be good or not.
From what I see, it's the user's option what to put there and if he wants to put some newlines in there. You cannot possibly account for every tool out there and for very mix.
For one, it would be desirable to have a code for stuff to put before the dumped text. Yet, the data in most roms just doesn't lend itself to this idea. Which is the problem you are experiencing.
Secondly, I think that having to dump in accordance to script formatting guidelines with some tool downstream the production chain is actually desirable and nothing to put into the table file standard, since this would open you up to having to update your standard in accordance with some new tool X that not only allows commenting characters, but possibly binary data insertion characters (a recent thread on RHDN; it seems Atlas cannot handle it currently).

Therefore I think it's best to leave it up to the user to decide what to put there. Personally, I would nix the formatting codes altogether, for facts I have stated earlier:
  • You don't define \\ as escape sequence for the literal backslash, opening the codes up for incompatibilities with user desires (because some people just need their slashes in there)
  • There are many powerful and dedicated text editors out there. If a user wishes, he can put "\n" in the table file, it gets dumped as a literal "\n" to the dump file and then it's only ever one expansion away from being a real newline.
  • It's doubled effort for the most part and trouble for people to implement. This idea needs scanning the text with a crude algorithm anyway. Incorporating a "$script" keyword in there would only complicate matters further. Also, it might be a novelty to you, but the word "script" is mostly not understood to mean "text" in non English-speaking parts of the world, so it'd be IMO a bad choice anyway.
    This job can be perfectly handled by text editors out there or in special dump utilities that would choose to support this option outside of the standard, thus not being incompatible with it and future updates (barring updates that would reintroduce control codes)


I personally think if you want to have your script like this, you could always do the following:


Code:
FE=<commentedline>\n


Then it's a quick regex to get the linebreak and another one to get the whole line commented with "//" or whatever from the start. Much easier and arguably more versatile. Also, pretty much doable with minimal experience in java and a batch file.


Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
So,  since the table does not currently define them, you need to define them again in the utility.


Again, I think incorporating this kind of tool-specific logic into the table file is not worthwhile. If find it makes a great deal of sense to dump a script in a specific way for further processing.


Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
Nothing I've thought of really feels 100% right to me. There seems to be some drawbacks with every idea I've thought of. Any ideas?


Well, first of all, I noticed that the standard lacks two things, in addition to the other list:

  • There ought to be a sentence explicitly forbidding implementation of more control codes for the sake of compatibility with future updates to the standard.
  • Having an end token once with and once without hex sequence yielding the exact same text sequence results in undefined insertion behavior and should be clarified.


Other than that, no I don't have any thoughts on that. I used to write my own utilities for text insertion that would also pre-format the text to my likings, so I really don't know what it takes for Atlas or Cartographer. However, I advise against implementing any more logic that strictly speaking only the insertion utilities themselves need for inserting text. A table file doesn't need to know that "//" is magical in Atlas for dumping via any utility to work properly. I understand that this part of the spec as well as its shortcomings are a direct consequence of Atlas' popularity. I think it would be reasonable to expect the table file standard to work properly on its own without implementing mechanisms for specific tools' needs. Because most situations can be encompassed by expanding certain user keywords after the dumping via regex or simple search and replace even.
So, in closing, I think less is more here.

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Feb 16th, 2011 at 3:16pm
I agree with you that we don't want to push anymore into the table file nor do we want to do anything tool specific. I would even almost agree with your idea to remove the formatting entirely. However, I don't think that's a good idea in practice as you describe because:


  • You require a second step after the initial dumping utility to get any line breaks in the script at all. (Referring to running regex or programming a rudimentary java utility).  It seems that complicates the process for no good reason.
  • Many end users will not be programmers or understand regex. In my opinion, they shouldn't have to be skilled in those areas in order to be able to dump simple text with a utility using this standard.
  • The vast majority (if not all) of applications of dumping ROM text to a text file require line breaks in some capacity. It simply isn't very readable to a human without them. For such strong usage case, it makes sense to me that users expect line breaks in their text dump. It either has to be included in the standard or at least be incorporated into the dumping utility somehow.


Also, we're not doing anything Atlas specific with this whole commenting thing. It's desirable in general terms (for translation work) to be able to dump text from a ROM and precede it by some distinguishing mark so it would not be included in the insertion later. How else could you omit it? Again, probably the vast majority of cases will want to do this. We don't need to know what a commenting delimiter is in the table standard (The insertion utility will however), but we do need to be mindful to make this possible in some way with our table format. The way we do it now accomplishes that, aside from the side effects of the first and last line issue I described. To me, the problem here isn't whether or not to include a line break formatting code, but simply how to better facilitate a predominant usage case scenario.

I understand that a post process with regex or other utility could easily translate user defined keywords. I just don't think it should be a requirement in the process just to get simple line breaks in your script dump!! As I mentioned earlier, the end users aren't going to be able to do that easily. They want to simply be able to use a single utility to dump or insert text. One step process in most cases. If you have need for more advanced scenarios, you are probably writing all your own stuff anyway as you've already mentioned you do.

Perhaps my vision is blurred by working on both the table standard and utility that uses it, but the whole thing should simplify and better the process we have in place now, or why bother? I don't really want to push a standard that would require more or additional steps in the process. That doesn't make sense to me.

Title: Re: 'Standard' Table File Format
Post by snarf on Mar 4th, 2011 at 8:12pm
Hi guys. I've been working on a program that will utilize table files. Working to your specification seemed like my best bet. Having gone over the spec a few times and implementing a reader, I have a couple of questions and comments. Sorry if any of it has already been addressed, there is a lot to read through between this thread and the spec.

1 - Hex Casing
I'm pretty sure that it's a safe assumption that either casing is acceptable, but you might want to make this explicit in the spec. All the examples use upper-case, and an implementer might not think to support lower-case hex digits.

2 - White-Space
Likewise, white-space is not addressed in the spec. I've made some assumptions here in my own implementation, but the spec should probably be explicit. I'd say it would be wisest to avoid un-called-for white-space when outputting a table file, but to trim any extra (non-ambiguous) white-space when parsing the table file. For example:

Code:
// Would this line be considered valid?
B1FF = TEXT
// The text part of the entry shouldn't be trimmed; the white-space should
// be preserved in the output. But do we interpret this as having a trailing
// space after the "=" (trimmed) or a leading space included in " TEXT"?

// I've chosen to interpret a line like this as if it were equivalent to
B1FF= Text
// where the string includes a leading space.

It seems like the best thing would be for the spec to specify that there should be no white-space except as part of a text string, but recommend parsers be tolerant of erroneous whites-pace. Alternatively, it could either be specified that extra white-space is either acceptable or completely invalid.

3 - Endianness
I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.

I understand that platforms use multi-byte words that have endianness. On system might represent the value 0xDEAD as {0xDE, 0xAD} and another as {0xAD, 0xDE}, but it seems that the table file should define an entry using the actual byte order. I would expect the table file to define the hex value in the same manner it would appear in a hex editor. Perhaps I am completely misunderstanding the significance of endianness in the context of the table file.

4 - Reference Implementation
Just a thought, really. I know the spec is still in a draft stage, but it would be great idea if once the spec is finalized, a reference implementation were written.

5 - Behavior for dumping/inserting
The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec. A quick example, section 2.2.1 explains what should happen when an entry is not found, but this information doesn't define the table file format; it prescribes behavior for dumping and inserting.

It's obviously doing more good than harm being there, but should probably be identified as supplemental information.

6 - Lexical Definitions
Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity. E.g.:

Code:
HexDigit :=
    0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F
HexByte :=
    HexDigit HexDigit
HexValue :=
    HexByte+
// Where + would denote 1 or more instances of a value
// Et cetera...


7 - End Tokens
From the spec, section 2.4:

Quote:
There may be an unlimited number of end tokens. Two variations are supported. One with the '=' character for end tokens requiring output to the script, and one for end tokens that will not appear in the script...An end token can be defined having no actual hex sequence associated with it. No actual hex sequence will be inserted in these cases.

It sounds like you are describing three different types of end tokens. Normal end tokens (/FF=<END>), those that will not appear in the output script (/FF), and then those that have no actual hex sequence associated (/<END>). I think this really needs clarification, especially since the latter two most certainly introduce ambiguity. (Is /abcd meant to represent $ABCD or "abcd"?)

Also, what happens when dumping if there are multiple end tokens without associated hex. I.e. how does the dumper decide which to insert? It would make sense to impose a limit of only one of these no-hex end tokens.



Sorry, this list was a lot shorter, but I keep coming up with questions as I work on my utility.

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Mar 7th, 2011 at 2:04pm
Thanks for the feedback. Not too many people are interested in this type of thing. :)


snarf wrote on Mar 4th, 2011 at 8:12pm:
1 - Hex Casing
I'm pretty sure that it's a safe assumption that either casing is acceptable, but you might want to make this explicit in the spec. All the examples use upper-case, and an implementer might not think to support lower-case hex digits.


Correct. It's assumed to be either case. I agree it should be explicit. It seems to make sense to put this into a lexical definition type section defining what a hexadecimal sequence is for the purpose of the this document. Not sure where it best fits into the document. Maybe at the beginning of section 2.0?


Quote:
2 - White-Space
It seems like the best thing would be for the spec to specify that there should be no white-space except as part of a text string, but recommend parsers be tolerant of erroneous whites-pace. Alternatively, it could either be specified that extra white-space is either acceptable or completely invalid.


There should be another note added to 2.2 that lines cannot start with white-space. The term 'notes' should probably be changed to 'rules' as well. The whole thing has been designed with the idea that each line's entry type can generally be determined by simply parsing the the first character on the line. So, I agree we don't want to allow white space here for keeping parsing simplicity. White-space doesn't exist to the right of the '=' character. For our purposes, it's part of the text sequence. So, I think we're in agreement here.


Quote:
3 - Endianness
I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.


It doesn't really come into play. We're always talking Big Endian for table entries, but that's only clear from examples, not from a direct statement. The table file by it's nature, must be big endian to function in the same manner as it appears in the ROM or byte data stream. We're defining that.


Quote:
4 - Reference Implementation
Just a thought, really. I know the spec is still in a draft stage, but it would be great idea if once the spec is finalized, a reference implementation were written.


I have written a WIP one in C# alongside developing this draft for my own utilities. I would anticipate releasing it when/if it eventually gets release worthy status. I'm really only an intermediary level programmer as far as software design, organization, and efficiency goes. I'm not sure my code is really worthy of being used as an exemplary model to follow. In any event, this one in C# would be the only one I personally would code. Feel free to code one! Having several is no problem. :)


Quote:
5 - Behavior for dumping/inserting
The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec. A quick example, section 2.2.1 explains what should happen when an entry is not found, but this information doesn't define the table file format; it prescribes behavior for dumping and inserting.

It's obviously doing more good than harm being there, but should probably be identified as supplemental information.


I understand where you're coming from. I've thought this over before and after consultation with several others, it was decided the document should encompass behavior necessary to standardize mapping using this format. The scope includes everything necessary to map hex to text and text to hex using this format. We're already defining sequences in the format and what to do with them, so why can't we make explicit what should happen during ambiguous,duplicate, or not found type cases? Why not make a standard so that a reference implementation will always be what users get from every tool?

Let's pretend we took out all behavioral information in that document. Now you write a dumper/inserter and I write a dumper/inserter. What's the result? We're going to get output VERY different from each other, defeating the purpose of the whole thing. Cross utility use would not be possible because all of those situations are treated differently at will between my utilities and yours. If that's the case, who really cares if our table file is the same when we get vastly different incompatible results? You've lost all interoperability and the reason to set out with this standard to begin with.

The way things are, I imagine your implementation and my implementation generate very compatible output. :)

We're not dictating how to dump or insert, merely how to map using our table file format. Yes, there is a bit of overlap as the table file and it's usage are a core part of the dumping and insertion process and task.


Quote:
6 - Lexical Definitions
Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity. E.g.:


This connects to number one. It's a fine idea and I may add some definitions. However, I doubt I will develop it to the extent your example alludes to. I just don't have much interest in writing it. If you would like to write something like that out for the document, I would surely include it.


Quote:
7 - End Tokens
From the spec, section 2.4:
It sounds like you are describing three different types of end tokens. Normal end tokens (/FF=<END>), those that will not appear in the output script (/FF), and then those that have no actual hex sequence associated (/<END>). I think this really needs clarification, especially since the latter two most certainly introduce ambiguity. (Is /abcd meant to represent $ABCD or "abcd"?)


It needs more clear wording, however there are indeed only two cases. Basically, it's either you have or do not have hex representation. Let me clarify:

Case 1:
/FF=<END>\n\n

In this case, this end token has hex equivalent of 0xFF, right? So, it will appear in the script when you dump and when you insert, it would be expected to insert 0xFF to the ROM. Clear?

Case 2:
/<END>\n\n

In this case, this end token has no hex equivalent. This is used in all cases where you need an end token for string start/stop and pointer applications, but need, nor have any hex representation for dumping or inserting. Think fixed length strings, pascal strings, or similar scenarios where an artificial end token may be desired. Take a look at Atlas documentation for even further usage scenarios. Suffice to say, we need these two variations of end tokens, but together they should cover all possible scenarios.

There is no ambiguity. "/abcd" sets up an end token with text sequence "abcd". You are using the variation with no hex sequence, so it's always a text sequence. "/$abcd" sets up an end token with text sequence "$abcd".


Quote:
Also, what happens when dumping if there are multiple end tokens without associated hex. I.e. how does the dumper decide which to insert? It would make sense to impose a limit of only one of these no-hex end tokens.


That's up to the dumper. In fact it has to be. Since there is no hex representation, there's nothing to define in the table file about it. It won't be in the dumping data stream. It's in the table for insertion purposes. Secondly, you might be dumping somewhere where you end up outputting different end tokens in the same dump.

If it's not clear and you want to discuss a specific scenario, I"ll explain further. I think the end token setup here is pretty solid and pretty much follows Atlas, which has been around for a number of years and been able to cover all usage scenarios I'm aware of.


Quote:
Sorry, this list was a lot shorter, but I keep coming up with questions as I work on my utility.


No problem. It's good to have feedback and discuss things farther to ensure we end up putting out a decent standard that all of 6 people will probably use. I'm glad you're one of them! :)

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Mar 18th, 2011 at 3:19pm
OK, I addressed all items in Tau's errata document and the few items in the last several posts in this topic. The most significant change is rewrite of section 2.4 to better clarify end tokens. The specification did not change on this, but the whole section was not very clear. You know, this stuff all makes perfect sense in my head, but I find it difficult sometimes to write it all out in a clear fashion. This project has certainly shown me there's plenty of room left for improvement in the technical writing department! I'm getting better though. :)

One comment:

Quote:
*557      |You used to write table ids in all-caps in the other examples

I think this helps demonstrate table ids can be upper or lower-case rather than it being a document consistency issue. If they were all upper case, it may lead the reader to think they are required to be that way.

Title: Re: 'Standard' Table File Format
Post by Tauwasser on Apr 4th, 2011 at 2:11pm
Hi,

first of all, I'm sorry for not answering in such a long time. Had to earn a degree and stuff ;)


Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
I would even almost agree with your idea to remove the formatting entirely. However, I don't think that's a good idea in practice as you describe because:


  • You require a second step after the initial dumping utility to get any line breaks in the script at all. (Referring to running regex or programming a rudimentary java utility).  It seems that complicates the process for no good reason.
  • Many end users will not be programmers or understand regex. In my opinion, they shouldn't have to be skilled in those areas in order to be able to dump simple text with a utility using this standard.
  • The vast majority (if not all) of applications of dumping ROM text to a text file require line breaks in some capacity. It simply isn't very readable to a human without them. For such strong usage case, it makes sense to me that users expect line breaks in their text dump. It either has to be included in the standard or at least be incorporated into the dumping utility somehow.


Also, we're not doing anything Atlas specific with this whole commenting thing. It's desirable in general terms (for translation work) to be able to dump text from a ROM and precede it by some distinguishing mark so it would not be included in the insertion later.


I admit I did not think about spacing between dumped strings. I personally feel that dumped strings do not need line breaks inside of them. But that's just my opinion.

As for separating the dumped strings, I think the dumping tool should handle that.

As for the usability issues with regex. Personally, for me a regex syntax would be favorable. It is just one thing to remember and no special codes that have to looked up after half a year of not using a table file.
People will have to learn a lot as is.

Also, most implementers will not need to write a regex engine by hand (and most would not be able to, either). Instead, regex implementations for the major programming languages involved so far are available free of charge under non-restrictive licensing if any.

Notice that a grouping-system for regex already exists.


Code:
/FF=\n//\1


Would precede the whole dumped line (which would be capture group 1) with a line break and "//". Problem for Atlas solved. And yes, I do think this is somewhat specific for Atlas and old-fashioned ways that still find their ways in Rom Hacking in general.
Some standard regex strings could be given as well.
I advocate separating original and translated text, though. See below.

However, for a slightly different (hypothetical) use case,


Code:
/FF=/1<END>\n//


might be a simple regex to meet the insertion utilities' format.


Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
How else could you omit it?


You would be able to omit it by not having it mangled in-between the translated text to begin with. Quite a few, if not all translation memory software do this using XML or other generic file types. In that way, one text "dump" can be translated to multiple languages and later recombined at will.

However, I realize the Rom Hacking Community is generally ages behind that and still uses ANSI text files for most things. Still, if I had to choose, I would advocate using separate files.

A simple layout using Notepad++ (and recreatable in jEdit, which is cross-platform) could look something like this:

http://img218.imageshack.us/img218/5720/translationworkbench.th.jpg

Click to view in FullHD glory.

Please also notice that no line breaks inside dumped strings are used. Ignore the XML-like tags, they're used for my custom dumper/inserter.

I'd imagine text scrolling could be an issue, but luckily I don't use line breaks in text, so disabling word wrap would solve alignment issues.


Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
To me, the problem here isn't whether or not to include a line break formatting code, but simply how to better facilitate a predominant usage case scenario.


First off, a little regex never hurt anybody and I think many people can benefit from using it from time to time.

Secondly, as seen above, it would not require an all-out in-depth knowledge of regex to get it working the way some people want.

You also use some predefined strings to mean the whole line etc.
However, experienced programmers will likely implement it using regex to determine what type of entry and what line end code behavior is wanted anyway.
Inexperienced programmers are in a sea of pain because they have to do string comparisons -- which likely will turn out not in compliance with Unicode -- and have to spend much more time on implementation. Also, their implementation would then most likely break on specific non-ascii cases when end tokens uses accents, Japanese, etc.


snarf wrote on Mar 4th, 2011 at 8:12pm:
3 - Endianness
I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.

I understand that platforms use multi-byte words that have endianness. On system might represent the value 0xDEAD as {0xDE, 0xAD} and another as {0xAD, 0xDE}, but it seems that the table file should define an entry using the actual byte order. I would expect the table file to define the hex value in the same manner it would appear in a hex editor. Perhaps I am completely misunderstanding the significance of endianness in the context of the table file.


I'm a little confused. The very thing you describe, i.e. "in the same manner it would appear in a hex editor" depends on Endianess to begin with.

However, Nightcrawler's clarification is correct. We're talking big Endian values when talking about specific values in the table file, because they are read sequentially and interpreted as that.


snarf wrote on Mar 4th, 2011 at 8:12pm:
5 - Behavior for dumping/inserting
The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec.


I beg to differ. While the text aims at unifying "table files", what is being unified is the textual representation of certain dumping and insertion processes and their matching behavior between textual representation and hexadecimal representation of the game script. As such, these definitions are naturally a part of the spec.

However, this might need to be explicitly stated somewhere.


snarf wrote on Mar 4th, 2011 at 8:12pm:
6 - Lexical Definitions
Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity.


I concur and would also offer to provide EBNF grammars and regex strings for matching entries.


Nightcrawler wrote on Mar 18th, 2011 at 3:19pm:
One comment:

Quote:
*557      |You used to write table ids in all-caps in the other examples

I think this helps demonstrate table ids can be upper or lower-case rather than it being a document consistency issue.


I just reread that particular part of the document. It might be worthwhile to explicitly state the casing for table ids and labels in general just like for hexadecimal values.

I will get to reread the document in whole soon, so I guess this is it for now.

cYa,

Tauwasser

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on May 20th, 2011 at 9:33am
Well, several more items are being kicked around over at RHDN:
http://www.romhacking.net/forum/index.php/topic,12644.0.html

It's already been a year. I'm not sure I want to to allow it to start to runaway, unravel and run off the tracks. Discussion over there is going in the direction of significant added complexity. I sent a message to Klarth, and I e-mailed Tau. I'm interested to hear what some of you think about some of the new discussion opened up in the link above to help me determine if I want to keep kicking this around or start to tighten up, finish, and move on. It's starting to become a bit draining with no end in sight if it continues like this.

This is really a reminder of why we never had a standard before, and why our community in general has so much trouble making any successful ones (patching format standards are case in point).

Title: Re: 'Standard' Table File Format
Post by Nightcrawler on Jun 10th, 2011 at 3:06pm
After lengthy discussions over at RHDN, a number of items were changed. This is a rough compilation of pages and pages of discussion thus far. I did not have time yet to proof read or spell check the current draft.

General Changes:


  • Created section 2.2.1 "Raw Hex Representation" Shifted all other subsections sections down one number and changed section references accordingly.
    2.2 Forbid "<$XX>" pattern from text sequences. It is still possible for table entries to potentially combine to create the raw hex sequence and be inserted improperly as a result. We agree to live with this possibility.

  • Revised Section 2.3. Renamed, and reworded to reflect single formatting sequence. Removed all references to comments or '//' in the example.

  • Revised Section 2.4. Removed artificial end tokens (end tokens with no hex representation) and reworded references to "/n" formatting sequence. Added note that duplicate endtoken text sequences follow 2.2.7 rules.

  • 2.2.6 + 2.6 Added clarificaiton on single logical table duplication and unique table name across all tables. Added clarification that the "\n" sequence is ignored when checking for duplication and provided example.

  • Edited 2.6 to specify one table per table file.

  • Amended 2.2.5 to provide recommended longest hex algorithm for text collision resolution and make note of other more intelligent algorithms.

  • 2.2.3 Amended behavior of no-entry-found behavior for inserting to require generation of error rather than ignore and continue.

  • Section 5. Lexical Definitions - Started this section. It was a good idea. I don't have much motivation to expand and write it though. I'm hoping someone else would take over this task.




OUTSTANDING BUSINESS:


  • As pointed out by abw's first post, we have an issue with leaving in formatting '\n' and trying to add comments. We can push comments out to the utility realm, but it is still difficult to be able to comment each line and say have no comments between strings like this example.

  • Along with above, consider "/FF=/1<END>\n//"  regex like syntax that could be used as alternative. Possible reconsideration for regex in table entries? See Tau's post.

  • Raw hex causes several issues. First, a combination of normal entries may inadvertently output a raw hex sequence during dumping and thus not insert. Secondly, inserting raw hex can cause an issues for table switching behavior. Lastly, it can be cause a subsequent token to be interpreted as a different token upon insertion. One solution to several issues was a general game control code and disallowing <> type characters in normal entries by abw.

  • There are some insertion issues that can arise where less in intelligent insertion (such as longest prefix) could result in  being interpreted as different tokens upon insertion. See the example at the bottom of this post.

  • Allow multiple tables per file? It may be useful to have all kanji/hira/kana in single file, even if different logical files. -Current ruling is to leave single table per file.

  • Linked Entry formatting strings. "$XX=<token %2X %2X>" Cleaner, easier to validate, but increases complexity.

  • Insertion for table switching details. Outside of the table file itself, but still needed.  Thus far, no suitable solution has been determined by anyone. Working on reducing supported features of tables switching to something more manageable.

Title: Re: Project - Table File Standard
Post by Nightcrawler on Aug 18th, 2011 at 9:13am
We're looking to close up the outstanding business over at the RHDN topic. We're getting through most of it. It shouldn't be too much longer. This has been holding up development of TextAngel, so I'm really serious this time that the next draft that incorporates the remaining business will in fact be the final draft and undergo final review by all interested parties. It will be feature complete, so at that point only edits and clarifications would be considered unless there was some glaring oversight.

After a year on this thing, this is probably the last standard I will head up. I've learned my lesson! :P

Title: Re: Project - Table File Standard
Post by Nightcrawler on Mar 28th, 2012 at 1:33pm
Rejoice! A new draft has been released that covers all outstanding business. It should be very close to final and now is officially entering the final review stage! :)

At this point, it should be safe for interested parties to start early development based on the current draft. No new features will be added and only minor changes and editing would be made. I hope to have all major involved parties review soon and target a release by May which will be the 2 year mark! :D

Many changes were made to the document and full reorganization of section 2 was made. If you have previously read the document, you should read through it again as this was a major edit with feature and rule changes.

Generalized List of Changes:


  • Total reorganization of Section 2 based around Normal and Non-Normal entry grouping mentality.
  • Added Declaration of Conformance section.
  • Added information on level of compliance and allowed insertion algorithms.
  • Added additional information to the overview
  • Standardized a common format and rules for all Non-Normal entries.
  • Control Codes ( formerly linked entries), end tokens, and table switching became non-normal entries.
  • Limited "\n" line break use to controlled circumstances in Non-Normal entries only.
  • Control Codes were reworked with formatting and inclusion of parameters.
  • Table Switching section was redone to be fleshed out with details discussed at RHDN.
  • Rules were added or changed in all sections to handle many edge cases presented in RHDN discussion.



Title: Re: Project - Table File Standard
Post by Nightcrawler on Nov 21st, 2012 at 9:08am
In case anyone was wondering what the status of this was. None of the consortium of peers I sent the draft to for final review had anything further to say. Before parading it around as official, I thought a few things needing doing first.

First, it needs an implementation. Releasing a spec without also releasing something that uses would doom it to obscurity or my personal use. So, I need to get a full implementation in TextAngel and then release that alongside the standard. Obviously I haven't yet finished TextAngel enough for a release nor do I have a complete implementation of all features of the latest draft in the table engine. There are some complex features and rules there which are time consuming to implement and test. It's all about time and time is always short.

Second, I think it could also use another small edit as I noticed a few typos remaining after not having looked at it in awhile. It might be useful for another set of eyes to do that as well. Lastly, I considered doing a fancier formatted PDF version.

Anyway, the spec itself is still final in content unless I come across some show stopping problem during my implementation in TextAngel. If anybody else is working on an implementation using this specification, by all means let me know how it is going. :)


Title: Re: Project - Table File Standard
Post by RyogaMasaki on Jan 14th, 2013 at 9:04pm
Hello, all. I'm in the process of (re)writing a ROM text dumping application and I plan to implement this table specification.

First of all, I'm not an advanced programmer. This project is actually my way of learning C# and is based on earlier versions I did with VB6 and C++. I was trying to dump the text from the Chrono Trigger proto back then, and I could see a need for an updated and standardized table format. My dumper used some custom codes in the table for formatting output and such, but the table spec you guys have worked out is a much better implementation, imo.

So, having worked with this for the last few days in C#, I have a few questions and comments. I haven't read through every bit of this thread or the one at the main romhacking.net forum, so please forgive me if some of these questions are already answered.

1. The BOM - According to Wikipedia (which I can't link to link to as I'm new to the message board, but they have a BOM entry), the BOM for UTF8 encoding is allowed but is not recommended, since UTF8 always has the same byte order. The table spec recommends its use however. Is there a specific reason for this?

2. Something I've always used in my tables is the # symbol for single-line comments. I understand these aren't really very important, but comments seem like something easy to deal with when parsing the string from the stream. I humbly suggest a way to add comments to tables.

3. End tokens - While it seems obvious that there would only be one end token per table (or per logical table?), the table spec doesn't mention this in particular. Can I assume that there will only be one end token, or should I allow for multiple per table? Also, with end tokens, does section 2.5.2 'inherit' the General Rules of section 2.5? Namely, should end tokens be multibyte? Sorry if this question seems a little stupid, heh.

4. Sect. 2.5.3 - "All tables to be used with switching functionality must include a unique ID line identifying the logical table"
Does this mean that if there are any logical (labeled) tables, that ALL tables must have a label? Currently, my code iterates through the text file until it comes to a normal entry. If there is no current logical table and it didn't find a label, it assumes this is the 'default' table and makes a dictionary for it. It stays in that dictionary until it comes upon a new @ label and creates a new one, etc.

Per the table spec, should there be allowed to be an unlabeled default/main table and then labeled extra logical tables, or should ALL tables be required to have a label?

Thanks for all the work everyone did in publishing this document. It makes things both easier for me and more challenging, as half my work has been done with a usable tbl format and as I have to make everything compliant! :)

Title: Re: Project - Table File Standard
Post by Nightcrawler on Jan 15th, 2013 at 10:40am
Great to know a few people out there have found this useful. I think it's the best text based table spec we will have until we eventually move on to XML tools, spreadsheets, or whatever else becomes the next generation of translation tools.


RyogaMasaki wrote on Jan 14th, 2013 at 9:04pm:
1. The BOM - According to Wikipedia (which I can't link to link to as I'm new to the message board, but they have a BOM entry), the BOM for UTF8 encoding is allowed but is not recommended, since UTF8 always has the same byte order. The table spec recommends its use however. Is there a specific reason for this?


I think that passage got its roots from early specs that did not require UTF-8 only. Thus, if multiple encodings were involved, the BOM signature was preferred. ASCII or other encoding can be difficult to distinguish from UTF-8 without BOM. We wanted to avoid guessing file encoding or needing user specification. I actually still do support a few other common encodings in TextAngel. I never took them out. I thought accepting them would aid in conversion and/or transition to UTF-8 of older tables for users. There are people whom have no idea what UTF-8 is or why they should use it. There are also lazy people who won't try anything new if it's too much work.

You're right, technically that recommendation should not be necessary if we're only dealing with UTF-8.


Quote:
2. Something I've always used in my tables is the # symbol for single-line comments. I understand these aren't really very important, but comments seem like something easy to deal with when parsing the string from the stream. I humbly suggest a way to add comments to tables.


What do you use comments in your table for? I don't think this has been brought up before.


Quote:
3. End tokens - While it seems obvious that there would only be one end token per table (or per logical table?), the table spec doesn't mention this in particular. Can I assume that there will only be one end token, or should I allow for multiple per table? Also, with end tokens, does section 2.5.2 'inherit' the General Rules of section 2.5? Namely, should end tokens be multibyte? Sorry if this question seems a little stupid, heh.


End tokens are like any other entry here. Multiple are allowed. Yes, they inherit the rules of 2.5 for their hex sequences and labels.


Quote:
4. Sect. 2.5.3 - "All tables to be used with switching functionality must include a unique ID line identifying the logical table"
Does this mean that if there are any logical (labeled) tables, that ALL tables must have a label? Currently, my code iterates through the text file until it comes to a normal entry. If there is no current logical table and it didn't find a label, it assumes this is the 'default' table and makes a dictionary for it. It stays in that dictionary until it comes upon a new @ label and creates a new one, etc.

Per the table spec, should there be allowed to be an unlabeled default/main table and then labeled extra logical tables, or should ALL tables be required to have a label?


You only need to label a table with a TableIDString line if there is an explicit table switch entry that uses it. Thus, it would be allowed to have an unlabeled main table and labeled additional tables assuming none of the additional tables had any table switch entries that called the main table (usually wouldn't happen in most games). It's probably best practice to label all your tables though. Additionally, myself and the few others I know that have implemented this table switching stack use labels internally anyway. So, even if you didn't explicitly define one, some kind of default label is given to tables that do not have it.


Quote:
Thanks for all the work everyone did in publishing this document. It makes things both easier for me and more challenging, as half my work has been done with a usable tbl format and as I have to make everything compliant! :)


Yes, a table implementation meeting this standard is more challenging and more work because so much additional behavior is now defined. There are more rules and features. I guess that's the nature of standardization though. Without it, everybody rolls their own different table spec to suit their needs. Much behavior is undefined, extended, or outright ignored.

Title: Re: Project - Table File Standard
Post by RyogaMasaki on Jan 15th, 2013 at 10:46pm
As for comments, I've used them for basic metadata, like a source URL or a log of 'version' updates to the file. It could also be used to declare that the file is Table File Spec compliant, so end users are aware of the specifics of the file (i.e. so they don't open a new spec file in Translhexation or something).

Also, I'm working on implementing Table Switching tonight, and I have a question. I initially misread the document, and assumed that all logical tables exist in one table file, in a format like this:


Code:
(start of file)
xx=entry
xx=entry
xx=entry
@NewID1
xx=entry
xx=entry
xx=entry
@NewID2
xx=entry
xx=entry
(End of file)


Meaning all logical tables exist in one text file, with an unnamed (but named internally) 'main' table, and then named logical tables to branch into.

However, looking closer at the spec ("Support of multiple table files...", "Only a single logical table per table file is allowed"), are you saying that each table file can only hold one logical table? As in, for the above example, there would need to be three files (main and the two branches)? If that is the case, it sort of makes things needlessly complicated when everything can be neatly sorted in one file. Maybe I'm not experienced enough, but I can't imagine a scenario where all the logical tables couldn't be stored in one text file.

Title: Re: Project - Table File Standard
Post by Nightcrawler on Jan 16th, 2013 at 9:53am

RyogaMasaki wrote on Jan 15th, 2013 at 10:46pm:
As for comments, I've used them for basic metadata, like a source URL or a log of 'version' updates to the file. It could also be used to declare that the file is Table File Spec compliant, so end users are aware of the specifics of the file (i.e. so they don't open a new spec file in Translhexation or something).


It's probably too late to add something like this now unless it was comments on their own line only. Allowing commenting on the same line as an entry would have many implications in parsing, rules, and text sequence restrictions. That's probably out. However, allowing comments on their own line would have little impact.


Quote:
However, looking closer at the spec ("Support of multiple table files...", "Only a single logical table per table file is allowed"), are you saying that each table file can only hold one logical table?


Correct. Majority vote was for allowing only one logical table per table file on grounds of added complexity with almost no gain. In addition to that, your example totally breaks the fact that the table standard does not rely on manual ordering of entries in any way. Your example relies exclusively on manual ordering. Move one of those TableID lines and the encoding changes entirely. Another thing that went into it was ease of upgrading existing software such as Atlas and TableLib. You are probably already familiar with one or the other.  Klarth's (creator of Atlas) had much input on the standard and modifying Atlas to accept the new spec is critical to any real adoption of the standard. A number of things like I've described were considered when that decision was voted on.

Title: Re: Project - Table File Standard
Post by RyogaMasaki on Jan 16th, 2013 at 8:53pm

Nightcrawler wrote on Jan 16th, 2013 at 9:53am:
However, allowing comments on their own line would have little impact.


Yes, sorry, I should have been more clear: the comments I use are on their own line entirely. They cannot be added to the same line as an entry or table ID.

Now, in reply to multi-file tables, I don't want to seem argumentative. I just want to have a firm understanding before I go further with my table code. Take this obviously broken example table, modified from section 2.5.3


Code:
@HIRA        Found an @, check if HIRA already exists in our collection of LogicalTable objects, if not create a new one with that ID, switch Current Table to HIRA
00=あ                Add these entries to the dictionary in the HIRA object...
01=い
02=う
03=[PlayerName]     (This example entry is in the spec in sect 2.5.3 but as far as I can tell it's an illegal entry from [ and ] in the text of a normal entry)
!F8=[KATA],0        Add F8 to Switch list in HIRA object
!F9=[KANJI],0       same for F9
!FA=[SUBSTRINGS],0  same for FA

@KATA               Found an @, check for KATA object, create if none, switch Current Table to KATA
00=ア                 Add to KATA...
08=セ
02=ウ               Order of entry keys doesn't matter,
01=イ                it will check if they exist before adding them,
05=サ                and throw an error if they exist already in this table
09=ソ
04=オ
06=シ
0A=カ
03=エ
07=ス
!F8=[HIRA],0      Add F8 to the Switch list in KATA object
!F9=[KANJI],0     Same for F9

@KANJI            Create KANJI logical table if it doesn't exist, switch Current Table to KANJI
00=亜             Add entries to Current Table...
01=意

@HIRA             Create HIRA if it doesn't exist, switch Current Table to HIRA
03=え              Add to Current Table...
04=お
05=さ
06=し
00=あ              ERROR: entry 00 is already in dictionary of object HIRA, abort reading the file
01=い
07=す
(EOF)            End of file, check through Switch lists of all log. tables and make sure any table references have associated objects, if not, prompt for other table file



This table file is problematic in several ways, but I want to use it to exemplify how I'm parsing the table. The table finds @HIRA and creates a Logical Table object with its own dictionary and lists keeping track of which entries are control codes, end codes, and table switches. The dictionary is populated with the entries, including F8-FA, which are also added to the TableSwitch list. After reading the entire file, it will check the TableSwitch list and make sure any IDs referenced have a matching Logical Table object. If not, that Table ID was not encountered in the table file. So, prompt the user for another file with the proper Table ID. In that way it supports the spec of multiple files as well as multiple logical tables in one text file. In the case above, it finds KATA, KANJI and HIRA objects but not SUBSTRINGS. It would then prompt the user for a table with an @SUBSTRINGS identifier.

You wrote that my example in the last post "relies exclusively on manual ordering." I'm not certain I understand what you mean by that. The bytes can be in any order inside a logical table, since they are stored as a collection. They must only be unique within that logical table. There is a certain manual order in that the text file is being read sequentially. This is key to how my method parses the file, as it work on the premise of "Work on current table -> find @ -> change current table to new Table ID -> continue work on current table." Basically, could you explain more about what you mean by relying on manual ordering?

I don't mind supporting multiple files for a table, especially if it's to support Atlas/TableLib (neither of which, honestly, I'm familiar with; I only did text hacking on a small number of games, with things like Translhexation and Thingy, several years ago). But I still don't understand how having it one physical file is 'added complexity.' Then again, I've only worked with dumping text. Is there some aspect of reinserting text that plays a part in this compexity that I'm not realizing? Would it be possible to support both multi-file and single-file instead of one or the other exclusively?

Thanks for taking the time to answer my questions. I have a request though. Would you mind sharing any spec compliant tables you have, especially one with table switching? Just to run through my own code and see what happens. I understand if you don't want to share something like that though. :)

Title: Re: Project - Table File Standard
Post by Nightcrawler on Jan 17th, 2013 at 8:37am
Yes, I understand how you were parsing your single file containing multiple logical tables.


RyogaMasaki wrote on Jan 16th, 2013 at 8:53pm:
There is a certain manual order in that the text file is being read sequentially. This is key to how my method parses the file, as it work on the premise of "Work on current table -> find @ -> change current table to new Table ID -> continue work on current table." Basically, could you explain more about what you mean by relying on manual ordering?

You wrote that my example in the last post "relies exclusively on manual ordering." I'm not certain I understand what you mean by that. The bytes can be in any order inside a logical table, since they are stored as a collection. They must only be unique within that logical table.


That's exactly the ordering I was talking about. Move '@KATA' and you change the map entirely. There is no manual ordering of any kind required in the current table spec. That was my point.


Quote:
But I still don't understand how having it one physical file is 'added complexity.' Then again, I've only worked with dumping text. Is there some aspect of reinserting text that plays a part in this compexity that I'm not realizing? Would it be possible to support both multi-file and single-file instead of one or the other exclusively?


This is a tall order question. I would suggest you browse through the topic here and over at RHDN for discussions on some of the intricacies of table switching. It is a complex subject, especially when it comes to insertion. In addition, I'd advise taking a look at Atlas to understand what it does and browse the source a bit. This standard was developed over a course of a year or two. I can only tell you what majority voted for and some of the reasons cited that I noted or recall. The rest is in the details spread out over time in the topics or lost in the event discussion was had in PMs.

Atlas
Table Lib


Quote:
Thanks for taking the time to answer my questions. I have a request though. Would you mind sharing any spec compliant tables you have, especially one with table switching? Just to run through my own code and see what happens. I understand if you don't want to share something like that though. :)


I haven't needed table switching for the games I've been working on since this was developed. I think you can find some real brain twisting examples in the topic over at RHDN though for table switching. I think some very evil cases were crafted to test with there. Beyond that since, TextAngel does not reflect the latest draft, my current practical tables and my test table do not either. They're hybrid. It's been a messy affair to simultaneously develop a standard, utility, and work on games that utilize both. Constant WIP state until things start to get finished! :)

Title: Re: Project - Table File Standard
Post by RyogaMasaki on Aug 24th, 2013 at 2:53pm
Hi again! I posted several months ago about adopting the table spec for a utility I was writing. Well, I've put the first public version up for download, with partial support for Nightcrawler's spec. I'll be continuing to work on the program, with the goal of eventually having 100% support for the table file spec, but I'm releasing a public version for now to get some feedback and bug reports. If anyone gives it a go, I'd appreciate any comments!

You can get it here: sudden-desu.net/dumpster/

Title: Re: Project - Table File Standard
Post by Nightcrawler on Sep 1st, 2013 at 12:42pm
That's great to hear! Are you planning on releases the source to ROMlib?

Unfortunately Dumpster doesn't even start on my computer. An unhandled win32 exception occurred in Dumpster.exe. (4668).  :(

What .NET version does it target?

Title: Re: Project - Table File Standard
Post by RyogaMasaki on Sep 3rd, 2013 at 9:58am
Huh, that sucks.. It targets 4.5, I don't think I can make it any lower (easily...) as I'm using some 4.5 stuff. Anyway, it's my first 'real' program so I'm sure I just missed something. What OS are you running?

Yes, I would like to open the source to both ROMlib and Dumpster, and certainly will, but for right now, as a personal project, there's more I want to add and clean up before presenting to the world.
I wrote the code for importing your table pretty early on, back in January and February, then got sidetracked with the graphics and such. Kind of had to just give myself a little refresher, haha. It currently supports control codes, end tokens and standard entries. It parses out named logical tables, but doesn't support the table switching tag ! and does multiple logical tables per one file. It could probably be a bit more strict and of course I want to have a mode for 100% spec compliance; I've just done it 'my way' for the development. Finishing the proper support is up there on the to-do list.

Title: Re: Project - Table File Standard
Post by Nightcrawler on Sep 4th, 2013 at 6:24pm
That's probably what it is. I only run 4.0. You can probably check for that and/or catch the exception rather than a crash on start-up.

What on Earth are you using that you require from 4.5 for a ROM hacking utility? That's virtually a Windows 8 exclusive release.

Title: Re: Project - Table File Standard
Post by RyogaMasaki on Sep 6th, 2013 at 12:39am

Nightcrawler wrote on Sep 4th, 2013 at 6:24pm:
What on Earth are you using that you require from 4.5 for a ROM hacking utility? That's virtually a Windows 8 exclusive release.

http://blogs.msdn.com/b/dotnet/archive/2012/04/03/async-in-4-5-worth-the-await.aspx

The async/await functions. Honestly, they're not 'necessary' on reasonably sized files, so I can make a version targeted at 4 without async, or using the old method.

Title: Re: Project - Table File Standard
Post by Nightcrawler on Sep 10th, 2013 at 5:46pm
Out of all the ROM hacking applications I've made in .NET, none of them ever required features beyond .NET 2.0 :P

Title: Re: Project - Table File Standard
Post by Kef Schecter on Dec 22nd, 2013 at 11:19pm
I'm writing a Python library to implement this standard (and probably support legacy TBL formats too), but I think the table switching feature is, well, a bit nutty. Yes, I see the value of it and I understand why it's there. But it introduces lots of complexity, especially when inserting. It confounds writing an "optimal path" algorithm, and you have to jump through hoops just to recognize when you need to switch tables. While I don't think the feature is a mistake per se -- it's definitely nice to have a standard way to handle the problems this feature is designed to tackle -- I think requiring it for compliance is. Rather, I think it should either be deferred to a 1.1 standard or be made a separate extension to the standard.

As it stands, I currently have no plans to implement this feature in my library.

I also strongly disagree with the idea of forcing each logical TBL to be in its own .tbl file. In my opinion, this is exactly backwards -- each .tbl file should be a complete unit that doesn't require any additional files to work. That way you can load just one .tbl file and be done with it.

I think we should have a place where we can hammer out the standard, including a complete record of arguments for and against everything, and get it to everybody's satisfaction and, perhaps even more importantly, get it beyond a draft that nobody really uses. Maybe it should be on some kind of wiki or something.


Finally, there are a few minor errors in the standard.

In section 2.5:


Quote:
Labels should consist only of digits [0-9A-Za-z].


A-Z and a-z are not digits. It should say "characters".

In section 3.2:


Quote:
!7F=[Dakuten,]-1


Surely the comma should be after the bracket, not before it?

In the same section:

Quote:
When "7F" is encountered in Table 2, so fallback to Table 1 will occur.


This is not grammatical; the "so" should be removed.


4.2 and 4.3 mention "Linked Entries", but this term occurs nowhere else in the specification and it is not clear what it means.

Title: Re: Project - Table File Standard
Post by Nightcrawler on Jan 5th, 2014 at 7:15pm
I agree utilizing table switching for insertion can be a complex subject. It's a big part of the additional functionality this standard brings and standardizes. I highly advise reading through this topic where this is discussed in great detail, and how it developed into what it is:

http://www.romhacking.net/forum/index.php/topic,12644.0.html

You should not need optimal path. The rules and limitations put in place toward the end greatly reduced many of the complexities in table path finding. Take a look through the topic linked above. Back-peddling on that now would be defying the majority and cutting a leg out from under it. Even if you did need optimal path, as I was told when I mentioned complexity exactly as you have, why should everyone be held back simply because one guy can't grasp an algorithm? A little harsh I know, but I guess it makes some sense.


Quote:
Correct. Majority vote was for allowing only one logical table per table file on grounds of added complexity with almost no gain. In addition to that, your example totally breaks the fact that the table standard does not rely on manual ordering of entries in any way. Your example relies exclusively on manual ordering. Move one of those TableID lines and the encoding changes entirely. Another thing that went into it was ease of upgrading existing software such as Atlas and TableLib. You are probably already familiar with one or the other.  Klarth's (creator of Atlas) had much input on the standard and modifying Atlas to accept the new spec is critical to any real adoption of the standard. A number of things like I've described were considered when that decision was voted on.


See this passage from a few posts up on why table files were restricted to one logical table per file. There are a number of reasons. Another majority vote that  I can't really back-out on now. However, it is certainly something to be considered (and that I'd probably support) for a 1.1 version later on.

There's nothing left to really hammer out. The content was deemed finalized and already a satisfying compromise amongst the consortium after 2 years of discussion. There's no way a wiki would ever work. You'd never reach a large scale public consensus. That's why hardly anything has ever been standardized in ROM hacking. Just as you come and disagree with pieces of this standard, another would always come and disagree with you. You have to draw a line somewhere or discussion and disagreement goes on indefinitely. At least with a small group, compromises truly can be made and a satisfying conclusion reached. You've already declared you don't want to follow the standard, offer no table switching compromise, so you are really opposing efforts as much as anything.

It's been stuck at draft mostly because I never finished TextAngel and Klarth never updated Atlas. I never wanted to officially release the standard without utilities to to use it or it would probably flop. If TextAngel was released and Atlas was modified, it would most likely certainly succeed. Neither of which has happened yet, so we're in limbo.

Most of the discussion can still be found in this topic and the topic at RHDN. There was other discussion via PMs (that I no longer have now), but I think that was much less so and usually with Klarth regarding Atlas. So, much of it is available to delve into. There so little interest in something like this that I don't think it's worthwhile to do anything special with it. One or two new people per year are interested enough to discuss it.

Thanks for pointing out the errors. I will work on correcting them. Linked Entries are a residual concept from a previous version that does not exist anymore. It was further developed into what is now Control Codes.

Nightcrawler's Message Board » Powered by YaBB 2.5 AE!
YaBB Forum Software © 2000-2010. All Rights Reserved.