yDecode for Windows v4.0

yDecode for Windows v4.0
MANUAL PAGE 2

Homepage - www.i-asm.com/yDecode/.
The latest version of yDecode can be downloaded as a zip file
from this link: yDecode.zip. Please note the terms of use.

Author: IanB - ian(at)i-asm.com
Please consider making a donation (click button!)

Detecting yEnc and UUE encodes

Before any decoding can take place, two global path settings must be confirmed, the input folder (where all encodes are looked for) and the default output folder, where all unfiltered files are decoded to.

Previous versions of yDecode used a preprocessor feature of NewsPlex that allowed an executable file (yCheck.exe) to be run for every raw .OK article file it downloaded for processing automatically before they were saved in the NewsPlex \async directory, so they could be renamed. This is no longer required.

yDecode creates its own queue of files to process from the input folder when the current queue is empty. It checks files for yEnc or UUE signatures and if it finds one will rename it with a .YNC or .UUE extension to flag it for further processing. For safety it will NOT, however, look in the following types of files for data (some of these are NewsPlex/yDecode working file types):

.YNC .YBAD .UUE .UBAD .DONE .LOG .DEL .BAD .DYC .DUE .SPLT .TMP
.ZIP .RAR .MP3 .RAM .RM .RA .MPG .MPEG .AVI .WMV .JPG .JPEG .BMP .GIF .PNG
.HTM .HTML .DOC .CRC .SFV .CSV .PAR .PAR2 .BAT .EXE .COM .DLL .INI .REG

For this detection phase of operation it only renames the files, it does not move them or decode them at this stage. Only the first 4096 bytes of a file are tested, which is enough to reach beyond the Usenet headers of most raw saved Usenet articles into the data.

Although strictly speaking the Usenet headers are not necessary for raw decoding, many of yDecode's features rely on two lines at least being present - the "Subject: " and the "Newsgroup: " header lines. UUE decoding in particular will fail without certain required multipart information present in the subject line, and file filtering cannot be performed without the newsgroup source information, nor will automatic repair requesting be possible.

The yEnc check is for the text "=ybegin" immediately after a Ctrl-Linefeed combination, as per the yEnc specifications. This allows both yEnc version 1 and putative yEnc version 2 files to be flagged. The UUE check is for the text "begin " after a Ctrl-Linefeed combination, or "end" after a Ctrl-Linefeed combination, or for a valid UUE line 300 bytes (approximately 4 or 5 valid lines) from the end of the test segment. A valid UUE line is one which is the correct length as given by the decoded first character.

Because the extra yCheck program is no longer required to detect and queue files, you must ensure that if you have previously been using it, all references are removed from the NewsPlex startup environment. If you are using NewsPlex 3.9 or earlier, make sure this text is REMOVED from the NewsPlex startup command line:

-X yCheck30.exe

For NewsPlex 4.0 or later make sure the following line is REMOVED from the [async] section of the etc\newsplex.ini file:

executeCmd=yCheck30.exe

and that if no other preprocessor is being used the following option is set:

executeYes=0

File filtering and routing

When a file is decoded, it can be routed to any desired output folder by adding and using filters. These examine the newsgroup source of the message and the file extension of the binary. It is therefore easy to create a set of filters that will route one set of files (say pictures) to one folder and any other types (say videos or music) to any number of others, or have more complex combinations depending on where the binary was originally posted.

Filters are listed in the INI file in an easily editable format, or they can be edited from within the program. They are listed in a strict checking order, which should have the MOST specific first, ie. "alt.binaries.multimedia" before "alt.binaries" or just "multimedia".

For simplicity, all text matches are treated as wildcarded, ie. "alt" is equivalent to "*alt*", so it is advised to use dot delimiters to ensure the desired results (ie. if you really want to match "alt." not "waltons"!). To avoid specious matches, there is a minimum of 3 characters for each filter. There are, however, no actual wildcard characters supported.

The text match string is immediately followed (no space) by all the file extensions to check for in brackets, a space between each one listed. The extensions are NOT implicitly wildcarded and must be matched exactly for intended results. Extensions can have minimum 2 characters, maximum 4, eg.

alt.binaries(mpg mpeg avi rm)=<path>

To route all files from that matched group rather than particular types, the extension string should be just (*) ie.

multimedia(*)=<path>

To route different types of files from the same source groups, each set will need to be listed explicitly, or you can use the (*) all files option after routing other files first from the match group, eg.

alt.binaries(jpg jpeg gif bmp)=<image path>
alt.binaries(ra mp3)=<music path>
alt.binaries(*)=<video and everything else path>

Any files not matching filter sources or file extensions will go to the default output directory, given in the main [yDecode Options] section of the INI file. All verification files (CRC, SFV, CSV, PAR, Pnn and PAR2) are routed deliberately to a separate \verification folder for processing before user rules are checked, so these file extensions are ignored/disallowed.

It is important to understand that when you have selected the use of filters, not only will incoming files be routed to the desired folders but also any verification sets will expect to find the files they match there as well!

If a file you expect should be recognised by a set is not being found and matched when the verification set is parsed, it is more than likely because it is not in the correct folder according to the given filter rules, perhaps because the rules settings are particularly complex. Importing the file manually through the User Actions dialog should relocate it if necessary.

TO AVOID UNEXPECTED RESULTS, IT IS ADVISABLE TO CHANGE THE FILTER RULES FROM WITHIN THE PROGRAM ONLY WHEN THERE ARE NO ACTIVE VERIFICATION SETS, AFTER INITIAL STARTUP HAS COMPLETED AND WHEN THE INPUT QUEUE IS EMPTY!

yEnc decoding specifications

Source yEnc files can contain multiple concatenated encodes. If each encode does not have its own Usenet header section, the details from the first (message subject and newsgroup source) will be used for all in that encode file.

Multipart yEnc data is inserted correctly into a .TMP file the size of the expected final file, and the part progress is noted in a small .DYC file made for each multipart binary which is deleted when all parts are finished. (This is an extended version of Jürgen's original .DEC file format for yDec). THESE FILES SHOULD NOT BE ALTERED OR THE MULTIPARTS WILL NOT BE CORRECTLY JOINED.

Like a .DEC file, the first part of a .DYC contains a list of data still missing that needs to be inserted. In this extended version, there then follows a list of ranges that were not fully added, ie. contained errors. This means that yDecode can decide when all the files for a multipart have been decoded, even if not all the ranges could be marked off successfully. There is also a message subject confirmation line to assist with verification.

A sample .DYC file might appear like this:

Subject=<message subject title minus bracketed part numbers>
100001,200000
Bad=
100001,200000
Completed

In this example, all the data except for the listed byte range above the "Bad=" line has been added correctly. That range may actually have been partially or even fully inserted into the file, depending on how much was decoded before the data error was detected in the segment. There is no "slack" in the output file, however, as its size, the size of the segment and its location in the final file are all known exactly from the yEnc wrapper information. Any missing data will be "junk", but whatever is good can possibly be parsed and repaired by PAR2.

The missing range is also listed below the "Bad=" line because it was detected as bad or incomplete when decoded, and the "Completed" line is appended if the byte ranges of all bad segments added together would complete the file. Bad ranges listed after the "Bad=" line, unlike those above it, are NOT joined together - each bad range line represents the data from a single incomplete decode.

The "Completed" marker doesn't prevent good data from replacement segments coming in subsequently and overwriting any bad data, but it does allow PAR2 verification sets to guess that no more data is likely to be added. The file can then be closed and parsed for good data blocks without worrying about the results changing later.

If a file is "Completed", new PAR2 sets on being built will find it and check the data if there is a name match, removing the .DYC and closing the multipart. Conversely, if the yEnc file is aware that there is a PAR2 set waiting to verify it, it closes itself when all parts have been decoded, bad or not, and offers what is available for verification. Obviously, there may be a period between a file being "Completed" and the relevant PAR2 set being fully decoded when the multipart will be marked "Completed" in the .DYC but is not claimed by any set.

If yDecode detects a bad yEnc file, as advised by Jürgen in the yEnc specs it will rename the source (with a .YBAD extension) and add the error type into the filename. These bad files are left in the input folder for user reference and there is a counter of all bad files found in the main window. If the file contained multiple encodes, the number within the file is also added to the filename.

Depending on the error the file may be partially or even fully decoded:

a bad yEnc or UUE header (missing or non-compliant information) in which case it will not attempt to decode the data at all;
a data error during the decode (either the presence of a yEnc escape character at the end of a line, or a data over-run or under-run compared with the part or binarysize info in the yEnc header), when it will finish after writing the data it has already decoded, in case that part was OK and can be used for PAR2 recovery;
a bad yEnc trailer (either information is missing or not consistent with what is in the header, or it is otherwise non-compliant) in which case it finishes having written all the decoded data to the output file, as it may actually be OK or can be recovered by PAR2;
a CRC error (the CRC of the decoded file or part is checked against any CRC info provided in the yEnc header or footer) in which case it also finishes having written all the decoded data to the output file, in case it's OK or can be recovered by PAR2, as the error may actually be in a bad calculation done by some buggy posting software.

I believe this program is compliant with the draft yEnc version 2 specs. However, some otherwise good yEnc files proved still not fully compliant even with the version 1 specs - to ensure reliable decoding, I had to completely remove error checking routines that tested for the length of message lines, and not all CRC values were expressed correctly in full 8-character hex format including leading zeroes, even if valid.

UUE decoding specifications

yDecode recognises and decodes UUEncoded binary messages conforming to usual Usenet specifications. However, UUE is an inherently unreliable Usenet transfer format and IMHO should be avoided in favour of the newer yEnc format at all costs!

In particular, there is no information provided with a UUE message about the correct size of the datastream within the message, so it is impossible to verify if the stream in any message is complete or has been truncated. If it is a multipart, the byte position of a datastream within the final message is also completely unknown.

Because of this complete lack of confidence in the validity of any UUE datastream, only an external verification check (with filesize, CRC or preferably MD5 calculation via PAR/PAR2) can guarantee that the final joined datafile is the right length and perhaps correctly up/downloaded.

Any data up to the first decoding error will be written. Because of the lack of error-checking data in the format, errors can only be raised by:

a bad/missing header line or no "Subject: " line in the source message, at least one of which is required to extract the output binary name and any multipart information;
a linelength over- or under-run in a UUE datastream - the datalength is specified by the decoded value of the first character on the line, usually "M" for a value of 45 (ie. 61 characters in the line total);
any dataline in the message (except the very last in the very last part) being a different length from the first - trailing space truncation (for UUE types not using the ASCII 96 backtick character (`) to encode zero) is however automatically handled.

Non-compliant or broken UUE files are renamed with a .UBAD extension including the error type in the filename, and left in the input folder. There is a counter of all bad input files found in the main program window. Multiple UUE encodes in a single file are supported, as long as each has its own full set of Usenet headers or the datastream is only interrupted by completely blank lines (zero length except for a Ctrl-Linefeed pair).

As UUE files are rare in the groups I have been testing with, I would be grateful if any encodes that are not recognised properly by yDecode are sent to me so that I can improve the detection and parsing algorithm. If you know that you will be decoding UUEs regularly, you are strongly advised to turn OFF the option to delete UUE source files until you are sure that yDecode is handling the files you normally receive reliably.

The sourcefile deletion option for UUEs works slightly differently to yEnc. In order to help preserve UUE data and make it easier to store and find, if the files are NOT set to be deleted, they are renamed with a .DONE extension but moved into yDecode's \UUE subfolder from the NewsPlex \async folder. In the event of unsuccessful processing by yDecode, these source files will be intact for other UUE processors to operate on.

Multipart UUE messages are recognised by yDecode and decoded into separate split files (standard .001 .002 etc. format, joinable by any common Windows split-joining utility) in that separate \UUE subfolder. Until all parts have been downloaded and decoded successfully, these split sets are named with a root filename made by an MD5 hash of the subject string (without bracketed part numbers) they match to. This ensures that each set is uniquely named.

This is necessary as due to the lack of embedded information they cannot be successfully concatenated into the final file until all parts are available in sequence. The output binary name is only given in the first part of a multipart message, so filter rules and output file naming cannot be applied until this has been queued and decoded.

yDecode tracks completion of the UUE multiparts by means of a .DUE file for each set in the \UUE subfolder. This is in much the same format as the .SPLT file for split sets described later, but is fundamentally different (while superficially similar) to the .DYC file for yEnc multiparts described earlier, as this shows completed rather than missing ranges. In addition, there are the following lines to aid processing at the top:

Subject=<message subject title minus bracketed part numbers>
BinaryName=<final binary name>
Source=<message source groups>

The "Subject=" line allows matching of the incoming multiparts to the set. The "BinaryName=" and "Source=" lines will be completed when the first part (which includes the "begin ###" header) has been processed. After these three lines the completed part number ranges are listed with the last number info. Any detected bad files for the set are at the end (may not be in order), eg.:

1,12
14,44
46,53
55,
Last=55
Bad=45,13,54
Completed

Note that single line entries (rather than ranges) MUST be followed by a comma. Note also that the single bad entries are listed on the same line as the "Bad=" marker, separated by commas. They are NOT joined into ranges. Unlike the .SPLT file described later, the "Last=" line is NOT optional, and MUST show the total number of parts in the set. (This info is extracted from the bracketed number in the message subject line when decoding.)

If a good replacement file is downloaded and decoded by yDecode, it will be added to the set properly in the good range entries above the "Bad=" line and the bad entry for that part is removed.

Like yEnc multiparts, multipart UUEs will also show as "Completed" if the number of detected bad parts plus the number of good decoded parts equals the total. New PAR2 files will detect matches with "Completed" UUE splits in exactly the same way, and will automatically join whatever parts are available to extract any good data for repair, deleting the source splits and the .DUE file. UUEs also track whether they will be PAR2-verified, and will close themselves for verification if they are "Completed" and there is a PAR2 set to match with.

The only difference in this behaviour is that if the first part in a multipart UUE has not been decoded, there is no output binary name to name the set with! In this case, the file is concatenated from the available splits in the \UUE folder using a filename of the MD5 hash root filename plus a .UUED (UUE Data) suffix. If good data is found it will be renamed and deleted on repair in any case.

When the .DUE file shows all parts have been successfully decoded, yDecode will join the multipart set and save it in the correct output folder automatically for verification (if available), but in the event of a problem the decoded parts can still be easily joined manually by a standard joiner utility if they are all present.

All parts (including the .DUE file) will be deleted when the final file is fully joined. So that incomplete multipart UUEs can be tracked between yDecode sessions, any remaining .DUE files are parsed on startup.

IT IS CRITICAL THAT THE .DUE FILES NOT BE ALTERED OR THE UUE WILL NOT BE JOINED CORRECTLY. The only possible useful alteration might be to add a known good multipart segment that has been separately downloaded and manually decoded then renamed (with the MD5 hash root name plus a split number suffix) to match the correct position in the set.