Welcome, Guest. Please Login or Register
 
The Purple Parade is marching in full stride to the beat of that 'other' drum we all hear, but generally ignore. Wink
Home Help Search Login Register


Pages: 1 2 3 4 5 
Send Topic Print
Project - Table File Standard (Read 49257 times)
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #45 - Nov 30th, 2010 at 4:06pm
 
I'm back! I've hiked an active volcano, snorkeled with dolphins and sea turtles, been above the clouds, seen magnificent waterfalls, and swam a secluded lagoon. Very nice trip!

I've re-read the posts of this discussion and given it some more thought. I still don't like the idea of non-injective tables. I'd prefer to see each normal table file line load 1:1 for the dumping or insertion direction, whichever is chosen. However, I recognize the user convenience of a single table file versus needing two in some of the cases discussed. I also recognize your demonstration that it can be accomplished with minimal code. So, I will concede that the format allow it.

Let's summarize the cases again and make sure we're in agreement:

  • Hex and text sequences are unique:
    'Normal' entry, loaded 1:1 in the dumping direction including empty strings. Empty strings are skipped in the insertion direction. 
  • Hex sequence is unique, text sequence is duplicate:
    Loaded 1:1 in dumping direction. The empty string is included. In the insertion direction, the shortest hex sequence is chosen to add to the map for that text sequence. If multiple hex sequences of the same size exist (mapping to the same text sequence), it does not matter which is chosen. However, if I do specify, I would choose last occurring in the table file. If the text sequence is an empty string, it is not loaded into the insertion map.
  • Hex sequence is ambiguous, text sequence is unique or ambiguous:
    An exception/error should be generated on duplicate or unrecognized hex sequences.
  • The hex sequence is the empty hex sequence, text is unique or ambiguous:
    An exception/error should be generated on empty hex sequences..


So, we'd have two changes. One for illegal sequences.

The only illegal sequences will be:
Duplicate Hex Sequences
$BA=one
$BA=two

Blank Hex Sequences
=one

The second change would be to allow for duplicate text sequences. We specify hex sequences mapping to blank text sequences are ignored for insertion. We specify the shortest hex sequence  should be used when there are duplicate text sequences.

OK?

Quote:
Well, not necessarily. With .NET Linq-Queries over arrays and datatables, it should really be a breeze to find the shortest hexadecimal sequence for a given string out of all tables. Can't say for C++, since I haven't program long enough in it to have been confronted with querying data tables.
There are several libraries out for Java which add Linq-like support to it. So collections can be searched.


I actually haven't used LINQ much. I'm not sure how it would work with all the multiple table examples we've had.

I actually implemented this with a simple .NET Stack() class. It seemed to most closely match the table switching concept as we logically wrote it out to me. The dumper has an an array of loaded table objects with an index for the active table object to use for decoding. When new tables are requested, the current index is pushed to the stack and the new active table object is called on. When a table expires for any reason, the last active table object index from the stack is restored. This allows for as many layers of table jumps as necessary and keeps track to fall back all the way to the starting table if necessary.
Back to top
« Last Edit: Nov 30th, 2010 at 6:39pm by Nightcrawler »  

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #46 - Dec 11th, 2010 at 8:36am
 
Hi, sorry for not anwering sooner, had to take care of some RL stuff.

I think we're pretty much lined up. The only issue I had was the following:

Nightcrawler wrote on Nov 30th, 2010 at 4:06pm:
  • Hex sequence is unique, text sequence is duplicate:
    [...]However, if I do specify, I would choose last occurring in the table file.


[...]

The second change would be to allow for duplicate text sequences. [...] We specify the shortest hex sequence should be used when there are duplicate text sequences.


I guess you're set now for specifying shortest hex sequence for duplicate text sequence and mentioned the other only as another possibility? Either way, I agree with the last statement: go for the shortest hex sequence for a duplicate text sequence.

Nightcrawler wrote on Nov 30th, 2010 at 4:06pm:
I actually haven't used LINQ much. I'm not sure how it would work with all the multiple table examples we've had.


This would only matter for insertion. You would have to query each table individually for a given text sequence. Pretty much a for loop over all tables will suffice in most cases.
Of course, there is additional logic involved for table switching, so as to get the shortest hex sequence for insertion purposes over the whole text, so unnecessary table switching can be found and averted.

cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #47 - Dec 14th, 2010 at 12:02pm
 
I've put a new draft revision up. It's mostly edits and fixes, however revision of section "2.2.5 Illegal Sequences" and addition of section "2.2.6 Ambiguous Situations" were done as a result of the discussions Tau and I have had here. I mentioned all cases in a way I thought would be clear to the target audience and fit within the style set by the document. This led me to group the results of our discussion into two groups. One being truly illegal sequences, and the other for cases where ambiguity exists. If you have a better idea for organization or presentation of these cases, I'm open to suggestion.

I believe this will be the last real content change and focus can move to a final edit and format. At this point, I should probably go around one more time and get all interested parties to sign off on the spec.


Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #48 - Jan 23rd, 2011 at 9:48pm
 
Hi Nightcrawler,

I finally got to reading through your spec once more from top to bottom.

I uploaded a list of erratas as well as suggestions to my google site. I find there are only minor specification issues left, so you did an excellent job Wink


cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #49 - Feb 7th, 2011 at 2:31pm
 
Thanks for the list. I've been meaning to do another complete edit myself. I will go through in detail at a later date. From what I see after browsing, I will probably have no issue with most changes. I've shifted gears for awhile on this in favor of some progress on other projects. I will get back to it in conjunction with more work on my utility. I've found it gives some good insight to develop a utility using the standard along side it.

Speaking of which, an issue did come to my attention during my utility development with comments. If you recall, in the beginning, it was proposed to use '/n' and '/r' as newline with and without comments. Then after comment by DaMarsMan, it seemed redundant, and unnecessary to do that.

If you look at the example in 2.3, if you want to dump a script with commenting characters of  say "//", we have this:

Quote:
Table Entries:
     FE=<linebreak>\n//
     /FF=<end>\n\n//


In practice, this seems to cause some small issues. First, the very first line wouldn't have any comment characters. It's only AFTER a line break that you'd see the comment characters. Second, you also have an issue with the very last line. Your script file will likely end with an '<end>', which will make for a blank commented line at the end.

I've handled this in my utility by having a template, since users are likely to want some sort of heading or footer information present. In there I allow for a $script variable, so the final output will put your script dump there. So I can easily take care of having the first line commented that way by simply putting '//$script' in it. In fact, that's what I did do for Heracles IV (I used the WIP utility for it). The final line of each file is still a commented blank line, but it doesn't effect insertion in any way.

So, it works, but it just seems a little hackish to me. There's also the need for the utility to know what comment characters are for insertion purposes. So,  since the table does not currently define them, you need to define them again in the utility. It's made me rethink about the old way, and also the option of not having commenting characters present in the table file at all. Instead, that can be pushed entirely to the utility where you'd define you want comments after every line break or end token there. I'm not sure if that's a good idea either as we'd be giving up the ability for any number of line breaks coupled with characters to easily appear for any table entry.

Nothing I've thought of really feels 100% right to me. There seems to be some drawbacks with every idea I've thought of. Any ideas?
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #50 - Feb 10th, 2011 at 4:14pm
 
Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
Speaking of which, an issue did come to my attention during my utility development with comments. If you recall, in the beginning, it was proposed to use '/n' and '/r' as newline with and without comments. Then after comment by DaMarsMan, it seemed redundant, and unnecessary to do that.


I would still find it redundant and unnecessary simply for the fact that \r and \n usually mean other things.

Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
If you look at the example in 2.3, if you want to dump a script with commenting characters of  say "//", we have this:

Quote:
Table Entries:
     FE=<linebreak>\n//
     /FF=<end>\n\n//


In practice, this seems to cause some small issues. First, the very first line wouldn't have any comment characters. It's only AFTER a line break that you'd see the comment characters. Second, you also have an issue with the very last line. Your script file will likely end with an '<end>', which will make for a blank commented line at the end.


I was actually wondering about that, but since I have never used Atlas before, I don't know if this would be good or not.
From what I see, it's the user's option what to put there and if he wants to put some newlines in there. You cannot possibly account for every tool out there and for very mix.
For one, it would be desirable to have a code for stuff to put before the dumped text. Yet, the data in most roms just doesn't lend itself to this idea. Which is the problem you are experiencing.
Secondly, I think that having to dump in accordance to script formatting guidelines with some tool downstream the production chain is actually desirable and nothing to put into the table file standard, since this would open you up to having to update your standard in accordance with some new tool X that not only allows commenting characters, but possibly binary data insertion characters (a recent thread on RHDN; it seems Atlas cannot handle it currently).

Therefore I think it's best to leave it up to the user to decide what to put there. Personally, I would nix the formatting codes altogether, for facts I have stated earlier:
  • You don't define \\ as escape sequence for the literal backslash, opening the codes up for incompatibilities with user desires (because some people just need their slashes in there)
  • There are many powerful and dedicated text editors out there. If a user wishes, he can put "\n" in the table file, it gets dumped as a literal "\n" to the dump file and then it's only ever one expansion away from being a real newline.
  • It's doubled effort for the most part and trouble for people to implement. This idea needs scanning the text with a crude algorithm anyway. Incorporating a "$script" keyword in there would only complicate matters further. Also, it might be a novelty to you, but the word "script" is mostly not understood to mean "text" in non English-speaking parts of the world, so it'd be IMO a bad choice anyway.
    This job can be perfectly handled by text editors out there or in special dump utilities that would choose to support this option outside of the standard, thus not being incompatible with it and future updates (barring updates that would reintroduce control codes)


I personally think if you want to have your script like this, you could always do the following:

Code:
FE=<commentedline>\n 



Then it's a quick regex to get the linebreak and another one to get the whole line commented with "//" or whatever from the start. Much easier and arguably more versatile. Also, pretty much doable with minimal experience in java and a batch file.

Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
So,  since the table does not currently define them, you need to define them again in the utility.


Again, I think incorporating this kind of tool-specific logic into the table file is not worthwhile. If find it makes a great deal of sense to dump a script in a specific way for further processing.

Nightcrawler wrote on Feb 7th, 2011 at 2:31pm:
Nothing I've thought of really feels 100% right to me. There seems to be some drawbacks with every idea I've thought of. Any ideas?


Well, first of all, I noticed that the standard lacks two things, in addition to the other list:

  • There ought to be a sentence explicitly forbidding implementation of more control codes for the sake of compatibility with future updates to the standard.
  • Having an end token once with and once without hex sequence yielding the exact same text sequence results in undefined insertion behavior and should be clarified.


Other than that, no I don't have any thoughts on that. I used to write my own utilities for text insertion that would also pre-format the text to my likings, so I really don't know what it takes for Atlas or Cartographer. However, I advise against implementing any more logic that strictly speaking only the insertion utilities themselves need for inserting text. A table file doesn't need to know that "//" is magical in Atlas for dumping via any utility to work properly. I understand that this part of the spec as well as its shortcomings are a direct consequence of Atlas' popularity. I think it would be reasonable to expect the table file standard to work properly on its own without implementing mechanisms for specific tools' needs. Because most situations can be encompassed by expanding certain user keywords after the dumping via regex or simple search and replace even.
So, in closing, I think less is more here.

cYa,

Tauwasser
Back to top
« Last Edit: Feb 11th, 2011 at 2:09am by Tauwasser »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #51 - Feb 16th, 2011 at 3:16pm
 
I agree with you that we don't want to push anymore into the table file nor do we want to do anything tool specific. I would even almost agree with your idea to remove the formatting entirely. However, I don't think that's a good idea in practice as you describe because:

  • You require a second step after the initial dumping utility to get any line breaks in the script at all. (Referring to running regex or programming a rudimentary java utility).  It seems that complicates the process for no good reason.
  • Many end users will not be programmers or understand regex. In my opinion, they shouldn't have to be skilled in those areas in order to be able to dump simple text with a utility using this standard.
  • The vast majority (if not all) of applications of dumping ROM text to a text file require line breaks in some capacity. It simply isn't very readable to a human without them. For such strong usage case, it makes sense to me that users expect line breaks in their text dump. It either has to be included in the standard or at least be incorporated into the dumping utility somehow.


Also, we're not doing anything Atlas specific with this whole commenting thing. It's desirable in general terms (for translation work) to be able to dump text from a ROM and precede it by some distinguishing mark so it would not be included in the insertion later. How else could you omit it? Again, probably the vast majority of cases will want to do this. We don't need to know what a commenting delimiter is in the table standard (The insertion utility will however), but we do need to be mindful to make this possible in some way with our table format. The way we do it now accomplishes that, aside from the side effects of the first and last line issue I described. To me, the problem here isn't whether or not to include a line break formatting code, but simply how to better facilitate a predominant usage case scenario.

I understand that a post process with regex or other utility could easily translate user defined keywords. I just don't think it should be a requirement in the process just to get simple line breaks in your script dump!! As I mentioned earlier, the end users aren't going to be able to do that easily. They want to simply be able to use a single utility to dump or insert text. One step process in most cases. If you have need for more advanced scenarios, you are probably writing all your own stuff anyway as you've already mentioned you do.

Perhaps my vision is blurred by working on both the table standard and utility that uses it, but the whole thing should simplify and better the process we have in place now, or why bother? I don't really want to push a standard that would require more or additional steps in the process. That doesn't make sense to me.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
snarf
Peasant
*
Offline


I Love TransCorp!

Posts: 1
Re: 'Standard' Table File Format
Reply #52 - Mar 4th, 2011 at 8:12pm
 
Hi guys. I've been working on a program that will utilize table files. Working to your specification seemed like my best bet. Having gone over the spec a few times and implementing a reader, I have a couple of questions and comments. Sorry if any of it has already been addressed, there is a lot to read through between this thread and the spec.

1 - Hex Casing

I'm pretty sure that it's a safe assumption that either casing is acceptable, but you might want to make this explicit in the spec. All the examples use upper-case, and an implementer might not think to support lower-case hex digits.

2 - White-Space

Likewise, white-space is not addressed in the spec. I've made some assumptions here in my own implementation, but the spec should probably be explicit. I'd say it would be wisest to avoid un-called-for white-space when outputting a table file, but to trim any extra (non-ambiguous) white-space when parsing the table file. For example:
Code:
// Would this line be considered valid?
 B1FF = TEXT
// The text part of the entry shouldn't be trimmed; the white-space should
// be preserved in the output. But do we interpret this as having a trailing
// space after the "=" (trimmed) or a leading space included in " TEXT"?

// I've chosen to interpret a line like this as if it were equivalent to
B1FF= Text
// where the string includes a leading space.
 


It seems like the best thing would be for the spec to specify that there should be no white-space except as part of a text string, but recommend parsers be tolerant of erroneous whites-pace. Alternatively, it could either be specified that extra white-space is either acceptable or completely invalid.

3 - Endianness

I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.

I understand that platforms use multi-byte words that have endianness. On system might represent the value 0xDEAD as {0xDE, 0xAD} and another as {0xAD, 0xDE}, but it seems that the table file should define an entry using the actual byte order. I would expect the table file to define the hex value in the same manner it would appear in a hex editor. Perhaps I am completely misunderstanding the significance of endianness in the context of the table file.

4 - Reference Implementation

Just a thought, really. I know the spec is still in a draft stage, but it would be great idea if once the spec is finalized, a reference implementation were written.

5 - Behavior for dumping/inserting

The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec. A quick example, section 2.2.1 explains what should happen when an entry is not found, but this information doesn't define the table file format; it prescribes behavior for dumping and inserting.

It's obviously doing more good than harm being there, but should probably be identified as supplemental information.

6 - Lexical Definitions

Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity. E.g.:
Code:
HexDigit :=
    0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F
HexByte :=
    HexDigit HexDigit
HexValue :=
    HexByte+
// Where + would denote 1 or more instances of a value
// Et cetera...
 



7 - End Tokens

From the spec, section 2.4:
Quote:
There may be an unlimited number of end tokens. Two variations are supported.
One with the '=' character
for end tokens requiring output to the script, and
one for end tokens that will not appear in the script
...An end token can be defined
having no actual hex sequence associated
with it. No actual hex sequence will be inserted in these cases.

It sounds like you are describing three different types of end tokens. Normal end tokens (/FF=<END>), those that will not appear in the output script (/FF), and then those that have no actual hex sequence associated (/<END>). I think this really needs clarification, especially since the latter two most certainly introduce ambiguity. (Is /abcd meant to represent $ABCD or "abcd"?)

Also, what happens when dumping if there are multiple end tokens without associated hex. I.e. how does the dumper decide which to insert? It would make sense to impose a limit of only one of these no-hex end tokens.



Sorry, this list was a lot shorter, but I keep coming up with questions as I work on my utility.
Back to top
« Last Edit: Mar 6th, 2011 at 1:34pm by snarf »  
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #53 - Mar 7th, 2011 at 2:04pm
 
Thanks for the feedback. Not too many people are interested in this type of thing. Smiley

snarf wrote on Mar 4th, 2011 at 8:12pm:
1 - Hex Casing

I'm pretty sure that it's a safe assumption that either casing is acceptable, but you might want to make this explicit in the spec. All the examples use upper-case, and an implementer might not think to support lower-case hex digits.


Correct. It's assumed to be either case. I agree it should be explicit. It seems to make sense to put this into a lexical definition type section defining what a hexadecimal sequence is for the purpose of the this document. Not sure where it best fits into the document. Maybe at the beginning of section 2.0?

Quote:
2 - White-Space

It seems like the best thing would be for the spec to specify that there should be no white-space except as part of a text string, but recommend parsers be tolerant of erroneous whites-pace. Alternatively, it could either be specified that extra white-space is either acceptable or completely invalid.


There should be another note added to 2.2 that lines cannot start with white-space. The term 'notes' should probably be changed to 'rules' as well. The whole thing has been designed with the idea that each line's entry type can generally be determined by simply parsing the the first character on the line. So, I agree we don't want to allow white space here for keeping parsing simplicity. White-space doesn't exist to the right of the '=' character. For our purposes, it's part of the text sequence. So, I think we're in agreement here.

Quote:
3 - Endianness

I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.


It doesn't really come into play. We're always talking Big Endian for table entries, but that's only clear from examples, not from a direct statement. The table file by it's nature, must be big endian to function in the same manner as it appears in the ROM or byte data stream. We're defining that.

Quote:
4 - Reference Implementation

Just a thought, really. I know the spec is still in a draft stage, but it would be great idea if once the spec is finalized, a reference implementation were written.


I have written a WIP one in C# alongside developing this draft for my own utilities. I would anticipate releasing it when/if it eventually gets release worthy status. I'm really only an intermediary level programmer as far as software design, organization, and efficiency goes. I'm not sure my code is really worthy of being used as an exemplary model to follow. In any event, this one in C# would be the only one I personally would code. Feel free to code one! Having several is no problem. Smiley

Quote:
5 - Behavior for dumping/inserting

The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec. A quick example, section 2.2.1 explains what should happen when an entry is not found, but this information doesn't define the table file format; it prescribes behavior for dumping and inserting.

It's obviously doing more good than harm being there, but should probably be identified as supplemental information.


I understand where you're coming from. I've thought this over before and after consultation with several others, it was decided the document should encompass behavior necessary to standardize mapping using this format. The scope includes everything necessary to map hex to text and text to hex using this format. We're already defining sequences in the format and what to do with them, so why can't we make explicit what should happen during ambiguous,duplicate, or not found type cases? Why not make a standard so that a reference implementation will always be what users get from every tool?

Let's pretend we took out all behavioral information in that document. Now you write a dumper/inserter and I write a dumper/inserter. What's the result? We're going to get output VERY different from each other, defeating the purpose of the whole thing. Cross utility use would not be possible because all of those situations are treated differently at will between my utilities and yours. If that's the case, who really cares if our table file is the same when we get vastly different incompatible results? You've lost all interoperability and the reason to set out with this standard to begin with.

The way things are, I imagine your implementation and my implementation generate very compatible output. Smiley

We're not dictating how to dump or insert, merely how to map using our table file format. Yes, there is a bit of overlap as the table file and it's usage are a core part of the dumping and insertion process and task.

Quote:
6 - Lexical Definitions

Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity. E.g.:


This connects to number one. It's a fine idea and I may add some definitions. However, I doubt I will develop it to the extent your example alludes to. I just don't have much interest in writing it. If you would like to write something like that out for the document, I would surely include it.

Quote:
7 - End Tokens

From the spec, section 2.4:
It sounds like you are describing three different types of end tokens. Normal end tokens (/FF=<END>), those that will not appear in the output script (/FF), and then those that have no actual hex sequence associated (/<END>). I think this really needs clarification, especially since the latter two most certainly introduce ambiguity. (Is /abcd meant to represent $ABCD or "abcd"?)


It needs more clear wording, however there are indeed only two cases. Basically, it's either you have or do not have hex representation. Let me clarify:

Case 1:
/FF=<END>\n\n

In this case, this end token has hex equivalent of 0xFF, right? So, it will appear in the script when you dump and when you insert, it would be expected to insert 0xFF to the ROM. Clear?

Case 2:
/<END>\n\n

In this case, this end token has no hex equivalent. This is used in all cases where you need an end token for string start/stop and pointer applications, but need, nor have any hex representation for dumping or inserting. Think fixed length strings, pascal strings, or similar scenarios where an artificial end token may be desired. Take a look at Atlas documentation for even further usage scenarios. Suffice to say, we need these two variations of end tokens, but together they should cover all possible scenarios.

There is no ambiguity. "/abcd" sets up an end token with text sequence "abcd". You are using the variation with no hex sequence, so it's always a text sequence. "/$abcd" sets up an end token with text sequence "$abcd".

Quote:
Also, what happens when dumping if there are multiple end tokens without associated hex. I.e. how does the dumper decide which to insert? It would make sense to impose a limit of only one of these no-hex end tokens.


That's up to the dumper. In fact it has to be. Since there is no hex representation, there's nothing to define in the table file about it. It won't be in the dumping data stream. It's in the table for insertion purposes. Secondly, you might be dumping somewhere where you end up outputting different end tokens in the same dump.

If it's not clear and you want to discuss a specific scenario, I"ll explain further. I think the end token setup here is pretty solid and pretty much follows Atlas, which has been around for a number of years and been able to cover all usage scenarios I'm aware of.

Quote:
Sorry, this list was a lot shorter, but I keep coming up with questions as I work on my utility.


No problem. It's good to have feedback and discuss things farther to ensure we end up putting out a decent standard that all of 6 people will probably use. I'm glad you're one of them! Smiley
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #54 - Mar 18th, 2011 at 3:19pm
 
OK, I addressed all items in Tau's errata document and the few items in the last several posts in this topic. The most significant change is rewrite of section 2.4 to better clarify end tokens. The specification did not change on this, but the whole section was not very clear. You know, this stuff all makes perfect sense in my head, but I find it difficult sometimes to write it all out in a clear fashion. This project has certainly shown me there's plenty of room left for improvement in the technical writing department! I'm getting better though. Smiley

One comment:
Quote:
*557      |You used to write table ids in all-caps in the other examples

I think this helps demonstrate table ids can be upper or lower-case rather than it being a document consistency issue. If they were all upper case, it may lead the reader to think they are required to be that way.
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Tauwasser
Peasant
*
Offline


Evil Impersonator

Posts: 14
Re: 'Standard' Table File Format
Reply #55 - Apr 4th, 2011 at 2:11pm
 
Hi,

first of all, I'm sorry for not answering in such a long time. Had to earn a degree and stuff Wink

Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
I would even almost agree with your idea to remove the formatting entirely. However, I don't think that's a good idea in practice as you describe because:

  • You require a second step after the initial dumping utility to get any line breaks in the script at all. (Referring to running regex or programming a rudimentary java utility).  It seems that complicates the process for no good reason.
  • Many end users will not be programmers or understand regex. In my opinion, they shouldn't have to be skilled in those areas in order to be able to dump simple text with a utility using this standard.
  • The vast majority (if not all) of applications of dumping ROM text to a text file require line breaks in some capacity. It simply isn't very readable to a human without them. For such strong usage case, it makes sense to me that users expect line breaks in their text dump. It either has to be included in the standard or at least be incorporated into the dumping utility somehow.


Also, we're not doing anything Atlas specific with this whole commenting thing. It's desirable in general terms (for translation work) to be able to dump text from a ROM and precede it by some distinguishing mark so it would not be included in the insertion later.


I admit I did not think about spacing between dumped strings. I personally feel that dumped strings do not need line breaks inside of them. But that's just my opinion.

As for separating the dumped strings, I think the dumping tool should handle that.

As for the usability issues with regex. Personally, for me a regex syntax would be favorable. It is just one thing to remember and no special codes that have to looked up after half a year of not using a table file.
People will have to learn a lot as is.

Also, most implementers will not need to write a regex engine by hand (and most would not be able to, either). Instead, regex implementations for the major programming languages involved so far are available free of charge under non-restrictive licensing if any.

Notice that a grouping-system for regex already exists.

Code:
/FF=\n//\1 



Would precede the whole dumped line (which would be capture group 1) with a line break and "//". Problem for Atlas solved. And yes, I do think this is somewhat specific for Atlas and old-fashioned ways that still find their ways in Rom Hacking in general.
Some standard regex strings could be given as well.
I advocate separating original and translated text, though. See below.

However, for a slightly different (hypothetical) use case,

Code:
/FF=/1<END>\n// 



might be a simple regex to meet the insertion utilities' format.

Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
How else could you omit it?


You would be able to omit it by not having it mangled in-between the translated text to begin with. Quite a few, if not all translation memory software do this using XML or other generic file types. In that way, one text "dump" can be translated to multiple languages and later recombined at will.

However, I realize the Rom Hacking Community is generally ages behind that and still uses ANSI text files for most things. Still, if I had to choose, I would advocate using separate files.

A simple layout using Notepad++ (and recreatable in jEdit, which is cross-platform) could look something like this:

...

Click to view in FullHD glory.

Please also notice that no line breaks inside dumped strings are used. Ignore the XML-like tags, they're used for my custom dumper/inserter.

I'd imagine text scrolling could be an issue, but luckily I don't use line breaks in text, so disabling word wrap would solve alignment issues.

Nightcrawler wrote on Feb 16th, 2011 at 3:16pm:
To me, the problem here isn't whether or not to include a line break formatting code, but simply how to better facilitate a predominant usage case scenario.


First off, a little regex never hurt anybody and I think many people can benefit from using it from time to time.

Secondly, as seen above, it would not require an all-out in-depth knowledge of regex to get it working the way some people want.

You also use some predefined strings to mean the whole line etc.
However, experienced programmers will likely implement it using regex to determine what type of entry and what line end code behavior is wanted anyway.
Inexperienced programmers are in a sea of pain because they have to do string comparisons -- which likely will turn out not in compliance with Unicode -- and have to spend much more time on implementation. Also, their implementation would then most likely break on specific non-ascii cases when end tokens uses accents, Japanese, etc.

snarf wrote on Mar 4th, 2011 at 8:12pm:
3 - Endianness

I'm just looking for a little clarification here. You discuss endianness in the spec, but I'm not clear as to how it actually comes into play. When dumping text, aren't we viewing ROM data as a byte stream? In this case there is no endianness, just single bytes, and I would expect the hex values in the table to be specified in actual order.

I understand that platforms use multi-byte words that have endianness. On system might represent the value 0xDEAD as {0xDE, 0xAD} and another as {0xAD, 0xDE}, but it seems that the table file should define an entry using the actual byte order. I would expect the table file to define the hex value in the same manner it would appear in a hex editor. Perhaps I am completely misunderstanding the significance of endianness in the context of the table file.


I'm a little confused. The very thing you describe, i.e. "in the same manner it would appear in a hex editor" depends on Endianess to begin with.

However, Nightcrawler's clarification is correct. We're talking big Endian values when talking about specific values in the table file, because they are read sequentially and interpreted as that.

snarf wrote on Mar 4th, 2011 at 8:12pm:
5 - Behavior for dumping/inserting

The specification identifies proper behavior for dumping and inserting. While this information is useful and relevant, it is not part of the file format and thus technically does not belong in the text of the spec.


I beg to differ. While the text aims at unifying "table files", what is being unified is the textual representation of certain dumping and insertion processes and their matching behavior between textual representation and hexadecimal representation of the game script. As such, these definitions are naturally a part of the spec.

However, this might need to be explicitly stated somewhere.

snarf wrote on Mar 4th, 2011 at 8:12pm:
6 - Lexical Definitions

Again, just a thought. The grammar of the format is pretty simple, and this certainly isn't a must, but I see value in a formal lexical definition of the format, which eliminates ambiguity.


I concur and would also offer to provide EBNF grammars and regex strings for matching entries.

Nightcrawler wrote on Mar 18th, 2011 at 3:19pm:
One comment:
Quote:
*557      |You used to write table ids in all-caps in the other examples

I think this helps demonstrate table ids can be upper or lower-case rather than it being a document consistency issue.


I just reread that particular part of the document. It might be worthwhile to explicitly state the casing for table ids and labels in general just like for hexadecimal values.

I will get to reread the document in whole soon, so I guess this is it for now.

cYa,

Tauwasser
Back to top
 
 
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #56 - May 20th, 2011 at 9:33am
 
Well, several more items are being kicked around over at RHDN:
http://www.romhacking.net/forum/index.php/topic,12644.0.html

It's already been a year. I'm not sure I want to to allow it to start to runaway, unravel and run off the tracks. Discussion over there is going in the direction of significant added complexity. I sent a message to Klarth, and I e-mailed Tau. I'm interested to hear what some of you think about some of the new discussion opened up in the link above to help me determine if I want to keep kicking this around or start to tighten up, finish, and move on. It's starting to become a bit draining with no end in sight if it continues like this.

This is really a reminder of why we never had a standard before, and why our community in general has so much trouble making any successful ones (patching format standards are case in point).
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: 'Standard' Table File Format
Reply #57 - Jun 10th, 2011 at 3:06pm
 
After lengthy discussions over at RHDN, a number of items were changed. This is a rough compilation of pages and pages of discussion thus far. I did not have time yet to proof read or spell check the current draft.

General Changes:

  • Created section 2.2.1 "Raw Hex Representation" Shifted all other subsections sections down one number and changed section references accordingly.
    2.2 Forbid "<$XX>" pattern from text sequences. It is still possible for table entries to potentially combine to create the raw hex sequence and be inserted improperly as a result. We agree to live with this possibility.
  • Revised Section 2.3. Renamed, and reworded to reflect single formatting sequence. Removed all references to comments or '//' in the example.
  • Revised Section 2.4. Removed artificial end tokens (end tokens with no hex representation) and reworded references to "/n" formatting sequence. Added note that duplicate endtoken text sequences follow 2.2.7 rules.
  • 2.2.6 + 2.6 Added clarificaiton on single logical table duplication and unique table name across all tables. Added clarification that the "\n" sequence is ignored when checking for duplication and provided example.
  • Edited 2.6 to specify one table per table file.
  • Amended 2.2.5 to provide recommended longest hex algorithm for text collision resolution and make note of other more intelligent algorithms.
  • 2.2.3 Amended behavior of no-entry-found behavior for inserting to require generation of error rather than ignore and continue.
  • Section 5. Lexical Definitions - Started this section. It was a good idea. I don't have much motivation to expand and write it though. I'm hoping someone else would take over this task.



OUTSTANDING BUSINESS:

  • As pointed out by abw's first post, we have an issue with leaving in formatting '\n' and trying to add comments. We can push comments out to the utility realm, but it is still difficult to be able to comment each line and say have no comments between strings like this example.
  • Along with above, consider "/FF=/1<END>\n//"  regex like syntax that could be used as alternative. Possible reconsideration for regex in table entries? See Tau's post.
  • Raw hex causes several issues. First, a combination of normal entries may inadvertently output a raw hex sequence during dumping and thus not insert. Secondly, inserting raw hex can cause an issues for table switching behavior. Lastly, it can be cause a subsequent token to be interpreted as a different token upon insertion. One solution to several issues was a general game control code and disallowing <> type characters in normal entries by abw.
  • There are some insertion issues that can arise where less in intelligent insertion (such as longest prefix) could result in  being interpreted as different tokens upon insertion. See the example at the bottom of this post.
  • Allow multiple tables per file? It may be useful to have all kanji/hira/kana in single file, even if different logical files. -Current ruling is to leave single table per file.
  • Linked Entry formatting strings. "$XX=<token %2X %2X>" Cleaner, easier to validate, but increases complexity.
  • Insertion for table switching details. Outside of the table file itself, but still needed.  Thus far, no suitable solution has been determined by anyone. Working on reducing supported features of tables switching to something more manageable.
Back to top
« Last Edit: Jul 6th, 2011 at 1:08pm by Nightcrawler »  

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: Project - Table File Standard
Reply #58 - Aug 18th, 2011 at 9:13am
 
We're looking to close up the outstanding business over at the RHDN topic. We're getting through most of it. It shouldn't be too much longer. This has been holding up development of TextAngel, so I'm really serious this time that the next draft that incorporates the remaining business will in fact be the final draft and undergo final review by all interested parties. It will be feature complete, so at that point only edits and clarifications would be considered unless there was some glaring oversight.

After a year on this thing, this is probably the last standard I will head up. I've learned my lesson! Tongue
Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Nightcrawler
YaBB Administrator
*****
Offline


The Dark Angel of Romhacking

Posts: 3245
USA
Gender: male
Re: Project - Table File Standard
Reply #59 - Mar 28th, 2012 at 1:33pm
 
Rejoice! A new draft has been released that covers all outstanding business. It should be very close to final and now is officially entering the final review stage! Smiley

At this point, it should be safe for interested parties to start early development based on the current draft. No new features will be added and only minor changes and editing would be made. I hope to have all major involved parties review soon and target a release by May which will be the 2 year mark! Cheesy

Many changes were made to the document and full reorganization of section 2 was made. If you have previously read the document, you should read through it again as this was a major edit with feature and rule changes.

Generalized List of Changes:

  • Total reorganization of Section 2 based around Normal and Non-Normal entry grouping mentality.
  • Added Declaration of Conformance section.
  • Added information on level of compliance and allowed insertion algorithms.
  • Added additional information to the overview
  • Standardized a common format and rules for all Non-Normal entries.
  • Control Codes ( formerly linked entries), end tokens, and table switching became non-normal entries.
  • Limited "\n" line break use to controlled circumstances in Non-Normal entries only.
  • Control Codes were reworked with formatting and inclusion of parameters.
  • Table Switching section was redone to be fleshed out with details discussed at RHDN.
  • Rules were added or changed in all sections to handle many edge cases presented in RHDN discussion.


Back to top
 

ROMhacking.net - The central hub of the ROM hacking community.
WWW  
IP Logged
 
Pages: 1 2 3 4 5 
Send Topic Print
(Moderator: Nightcrawler)