Skip to content

Support moved compressed rows in SAS data files#365

Open
hpoettker wants to merge 1 commit intoWizardMac:devfrom
hpoettker:moved-rows
Open

Support moved compressed rows in SAS data files#365
hpoettker wants to merge 1 commit intoWizardMac:devfrom
hpoettker:moved-rows

Conversation

@hpoettker
Copy link
Copy Markdown

@hpoettker hpoettker commented Apr 16, 2026

Introduction

This PR adds support for "moved" compressed rows in SAS data files.

I'm not aware of a public description of this feature but I've investigated the hex dumps of SAS files that ReadStat currently cannot read, reverse-engineered the logic, and then validated the read data against exports of the data files as produced by SAS.

I'm not aware what the exact conditions are that trigger the "moving" of rows but these conditions seem to be necessary:

  • the data files are written with compression enabled
  • the data files are both input and output to a SAS procedure like e.g. a data step

The technical term "moved row" is something that I've made up on the basis of what I've observed. The naming is up to discussion but I'll drop the quotation marks in the text below.

Compression Types

The most widely known compression types that can be read in subheader pointers of SAS data files are

  • 0x00, indicating no compression of the linked content
  • 0x01, indicating that the linked content can be skipped
  • 0x04, indicating that the linked content contains a compressed row

With the feature of moved rows, there are three additional compression types:

  • 0x03, indicating the logical position of a row that is actually on a different page of the data file
  • 0x06, indicating the physical position of a row that is referred to by a 0x03 compression type subheader pointer
  • 0x0d, indicating subheader pointers that can be skipped similarly to 0x01 for currently unknown reasons

A speculative interpretation of this list of compression types is that the compression type in subheader pointers is actually a bitmap with the following meanings of the bits:

  • 0x01 - any data that the subheader pointer may directly refer to shall be ignored
  • 0x02 - the subheader pointer is related to rows that have been moved
  • 0x04 - the subheader pointer directly refers to a compressed row
  • 0x08 - unknown (but I've only observed it as part of the type 0x0d, which also matches the bit 0x01 and can thus be ignored)

Compression type 0x03

The typical subheader pointer in a SAS data file contains the following pieces of information:

  • an offset that refers to a position within the same page
  • a length that in combination with the offset delineates an area of the same page
  • the compression type
  • an additional flag

With compression type 0x03, the byte positions and lengths within the subheader pointer are the same but the meaning of the values is different:

  • the first value represents a page number (with the numbers starting at 1)
  • the second value represents a subheader pointer number (with numbers starting at 1)
  • the compression type field contains the value 0x03
  • the additional flag, whose meaning is not relevant here

The order of rows in a SAS data file is normally defined by the order in which a pass of the file encounters them. But when a subheader pointer with compression type 0x03 is encountered, this only defines the logical position of the row in the order of encounter while the actual data is on a different (and as far as I can tell later to be encountered) page.

For subheader pointers of compression type 0x03, the previous and the next pointer will usually refer to neighboring areas within the same page.

Compression type 0x06

The reference from a subheader pointer with compression type 0x03 is always to a subheader pointer on another page that has the compression type 0x06.

A subheader pointer with compression type 0x06 does not represent any logical position of a row in the usual order of encounter. But the compressed row that is at the phyical position that the pointer refers to can be read exactly like rows of compression type 0x04.

Compression type 0x0d

I don't have a good explanation for this compression type. But just skipping subheaders with this type as one does with compression type 0x01 leads to the correct result when comparing the exports of ReadStat with those of SAS itself.

Implementation alternative

The implementation proposed in this PR respects the difference between the logical and the physical order of rows in a SAS data file, and replicates the order in which SAS itself presents the rows of a data file.

An alternative implementation that would be more efficient but would loose the faithfulness to the logical order of rows would be to

  • treat compression type 0x06 the same as 0x04
  • ignore compression types 0x03 and 0x0d just like 0x01

For many use cases, this would be good enough.

Validation

The change proposed in this PR works as expected on the SAS data files that I have access to. But I can share neither the data files themselves nor the SAS code that produces them.

As stated in the introduction, I'm not aware of what the precise trigger for SAS is to move rows to another page. So I'm also unable to provide generic SAS code that would create a toy example of a SAS data file with moved rows.

I've implemented the proposed change with the goal to introduce minimal risk. It should not affect affect any SAS data file that ReadStat currently reads successfully and contains validations against all constraints that I've observed.

I'm opening this PR in the hope of either someone of the community stepping forward with supporting information on moved rows in SAS data files or a leap of faith on part of the maintainer.

If there is any kind of follow-up question on this proposed change, I'd be happy to engage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant