SAS® Program Efficiency, Large Datasets and System Options

Training Videos:

Basic simple program efficiency steps as well as more complex strategies and techniques are included in this section. Before implementing an upgrade, make sure to benchmark measurements. In addition, SAS® system option papers are also included. See also SAS Business Intelligence and ETL, SAS Version 9.2 for hashing, SAS Certification exam and Proc SQL. See Compare and Conquer SAS Programming Techniques.

Before deciding to use SAS Indexes, it makes sense to evaluate the advantages and disadvantages. In general, over 80% of SAS users do not use SAS indexes. Keep in mind that SAS Indexes are for very specialized purpose and come with some conditions and at a price. You must first know your data very well and the types and quantity of daily data transactions to make a decision.

USE SAS Indexes if .. (See Basic paper, FAQ paper)	DO NOT Use SAS Indexes if ...
1. You want to reduce processing time by accessing one dataset using the direct access method instead of the normal sequential access method. Indexes save time by allowing SAS to read starting from the middle of the dataset instead of starting from the first record and reading the entire dataset when returning matching conditions.	1. You do not want to maintain SAS indexes since using indexes creates a separate file which gets deleted if the dataset is recreated without recreating the index.
2. You plan to index on few key variables in one dataset which are used for subsetting records, for example. Simple index is indexing one variable, which is the most popular, and complex index is two or more variables, where more than two variables is rare. Datasets can have multiple simple and complex indexes. Simple index files are the same name as the key variable and complex index file names need to be specified.	2. You do not have at least 10% of input dataset size as extra storage to save the separate index file which is the dataset name with .SAS7BNDX extension for simple indexes.
3. You expect your data, or indexed variables, to be stable and not change often since this requires recreating the index file.	3. You expect to select more than 25% of all records when subsetting since this makes indexes very inefficient. Ideally, indexed values will be less than 10% of the records. GENDER, for example, is a poor index variable.
4. You are processing VERY LARGE datasets, such as with over 10,000 records, on repetitive or transactional type processing, grouping, sorting for merging or subsetting datasets. Generally indexed variables are using these statements: BY, WHERE, SET/MODIFY and KEY option in DATA steps or SAS Procedures.	4. Your dataset is small since without index will be faster. Using indexes on small datasets is overkill.
5. You continue to use PROC SORT to run up to 50% faster. Using SAS Index does not prevent you from using PROC SORT.	5. If you are using Hash objects since using Hash objects is about 10% faster than indexes.
6. You are fine with the required ascending sort order of the indexed variables. Using DESCENDING and NOTSORTED options in PROC SORT will not use indexes.	6. You plan to use selected SAS Procedures such as PROC APPEND since using indexes is not compatible with these procedures.

SAS Procedure examples for creating SAS Indexes	Description
data sales.bighits (index=(cdnumber / unique)); set olddata.oldhits; < additional SAS statements > run;	1. Using the INDEX DATA set option in a DATA step to create a simple index of CDNUMBER variable with same name file on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS. Note that the INDEX DATA set option can also be applied in PROC SORT, for example. This makes it easier for recreating indexes whenever the dataset sets updated.
data sales.bighits (index=(numname=(cdnumber artistname) / nomiss)); set olddata.oldhits; < additional SAS statements > run;	2. Using the INDEX DATA set option in a DATA step to create a complex index named NUMNAME which is based two variables CDNUMBER ARTISTNAME on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS. Note that the INDEX DATA set option can also be applied in PROC SORT, for example. This makes it easier for recreating indexes whenever the dataset sets updated.
proc sql; create unique index cdnumber on sales.bighits; quit;	3. Using PROC SQL to create a simple index with the same name file CDNUMBER on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS.
proc sql; create index numname on sales.bighits(cdnumber artistname); quit;	4. Using PROC SQL to create a complex index named NUMNAME which is two variables CDNUMBER ARTISTNAME on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS.
proc datasets library=sales; modify bighits; index create cdnumber / unique; run;	5. Using PROC DATASETS to create a simple index with the same name file CDNUMBER on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS.
proc datasets library=sales; modify bighits; index create numname=(cdnumber artistname) / nomiss; run;	6. Using PROC DATASETS to create a complex index named NUMNAME which is two variables CDNUMBER ARTISTNAME on the SALES.BIGHITS dataset. One or more options for creating index include: UNIQUE and NOMISS.
data work.direct; do age=13,14; do until (eof); set class key=age end=eof; if _IORC_=0 then do; /* 0 indicates a match was found / put _all_; output; end; else _ERROR_=0; / if no match, reset the error flag and continue / end; end; stop; / stops future searching on ages 13 and 14 have been found */ run;	7. Using the DATA Step to use a simple index with the KEY= option for direct access method of the AGE key variable. SAS automatically knows an simple index exits on AGE. For composite index, such as COUNTRY_STATE, SAS can select records based on two or more variables. This is similar to using the POINT= option with record #.
data work.driver; age=13; output; age=14; output; run; data work.direct; set work.driver; /* <- sequential access & implicit loop / do until (eof); / <- explicit loop / set work.class key=age end=eof; / <- direct access / if _IORC_=0 then do; / 0 indicates a match was found / put _all_; output; end; else _ERROR_=0; / if no match, reset the error flag and continue */ end; run;	8. Using DATA Step to use a simple index with the KEY= option with values in another dataset using direct access method of the AGE key variable.
data illinois; set indexlib.prodindx (idxname=state); where state eq 'Illinois' and product in ('BED' 'DESK'); run;	9. User has option to override SAS in using the default index using the IDXNAME= DATA set option. In WHERE statements, users also have the option to force or turn off using SAS index with the IDXWHERE= option.

Balance of Efficiency Resources
Reducing CPU often results in reducing Data Storage and Input/Output but may increase Memory usage
80% improvements can be expected with simple programming techniques 20% improvements can be expected with fine tuning, optimal combination of resources, index, system options, hardware upgrades, datawarehouse strategies such as summary-level datasets

Programming Techniques - using Proc SQL for multi-tasking or using PROC DATASETS to modify data set structure
CPU Time - using KEEP or DROP to limit number of variables or PROC SORT options such as SORTEDBY= and PRESORTED options
Data Storage - using LENGTH to reduce variable size or SAS View instead of a SAS Dataset
Input/Output - using WHERE data set option to limit number of records
Memory - reading only the records, variables or datasets required

DATA Step vs. Proc SQL Differences

In general, simple merges are faster with DATA Steps and complex merges are faster with Proc SQL. This is because DATA Step uses sequential reads while Proc SQL first creates a Cartesian Product and then subsets the dataset.

Dataset Factors - Size, Static, Relationships between datasets, Density of Matches, and Outside SAS Access.

Operating System - Windows and Unix (generally similar), MVS and VMS (DATA step is generally faster).

One-Stop Reporting - Proc SQL is faster than 2 Proc Sorts, DATA merge, 1 Proc Sort and 1 Proc Print.

Improve Proc SQL by up to 2X - When joining tables, process pre-sorted tables as compared to processing unsorted tables.

Animated Guide Speed Merges: resource use by common procedures, Russ Lavery

Proc SQL Advantages

Multi-tasking and sorting with one pass instead of multiple passes

SELECT name, sex                 (Memory) FROM sashelp.class           (CPU with SORTEDBY= option when joining tables)
WHERE sex = 'F'                    (Input/Output)
GROUP BY sex                      (CPU)
HAVING weight > avg(weight) (Input/Output)
ORDER BY name;                    (CPU)

SQL Pass-Through for native database processing

Validate syntax before executing

Better programming style with concise and self-documenting code

(SORTEDBY = ) dataset option to take prevent re-sorting dataset

Utilize dictionary tables to access summary level data

SAS® Paper Dear Miss SASAnswers: A Guide to Efficient PROC SQL Coding

SAS® Savvy Resource SAS Programming Efficiency Factors Summary Sheet

Compare Program Efficiency with DATA Step

SAS System Options

Working the System: Our Best SAS Options, Patrick Thornton, Luliana Barbalau

Save/Retrieve SAS Systems Options Are Your Friends, Edward Heaton