DATA Step and MERGE     



Training Videos:

Best of Both Worlds: Data Step and Proc SQL Joins (Proc SQL Blog

WHERE vs. IF Statements: Knowing the Difference in How and When to Apply 

Which comes first PUT() or the INPUT()? (SAS Functions Blog Summary Sheet)

Reading and Writing to Excel files (Excel Blog)

Which Direction is your SAS Arrays? 

Lost without SAS Indexes?  (SAS Options Blog)

Five Habits of Highly Effective SAS Programmers (Compare and Conquer SAS Programming Blog)

What's In a Variable Name? (Data Step/Merge Blog)


Along with converting text files to SAS® datasets, the DATA step has the power to create and manage variables.  For example in the pharmaceutical industry, the DATA step is used to post-process RTF files. For customized reports, there is a section on DATA _NULL_ reporting.

Below is a collection of SAS® papers categorized by GeneralData Management, Arrays, and Merging Datasets.  See also Access to External filesSAS Certification, SAS Functions/Informats and Formats, SAS Dates and Proc SQL.  Download Merge Datasets Worksheet.  See also DATA Step SAS examples, Common FAQ and Training Videos on selected SAS topics.  See also on-line reference one and two.


Assigning Special Character Values to Character Variables

* List Input Format Modifier;

data work.address;

* assigns missing to all variables;

input name: $10. address & $30. birthdt: date9.;

format birthdt mmddyy10.;

* assigns raw values to variables;

*        1         2         3;

*23456789012345678901234567890123456789;

cards;

Tim      123 My Street       12MAR1970

Michelle 123 Other Street    22JAN1973

;

run;

Data Management Techniques (DATA Step Programming)

1. SET options (SAS Syntax), Dataset options (SAS Syntax)

2. Merging Datasets / Lookup Techniques (1:1, 1:M)

3. FIRST. and LAST. values with multiple BY variables (SAS Syntax)

4. Calculations based on differences between one or more variables (MAX(), RETAIN)

5. Calculations based on differences between one or more variables across rows

______________________________________________

1. Types of Variables

 
Variable Types

Value 
 
Variables from SET <Dataset_name>, ex. SET SASHELP.CLASS


 Retained
 
New <Variable_name>*, ex. BMI=WEIGHT/HEIGHT;

 
Calculate or '.' unless RETAIN 
 
Automatic, ex. _N_, _ERROR_

 
System Assigned
 
BY <Variable_name>*, ex. BY SEX;

 
FIRST., LAST. = 1 or 0

 SUM statement*, ex. RETAIN COUNT2; COUNT2=SUM(COUNT2 + 1) or COUNT2 + 1;
 
i.e. RETAIN COUNT2; COUNT2=SUM(COUNT2, 1);
 
IF condition to assign variable*, ex. IF FIRST.SEX THEN SEXGRP = 1;

 
'.' unless assigned
 
IF/WHERE condition to subset dataset, ex. IF FIRST.SEX; WHERE SEX = 'M';

 
Subset dataset
 * Applicable to both Do-Loops and Arrays  

______________________________________________

 Append/Set/Contatenate                Interleave                        Merge w/, w/o By




2. Merging Datasets/Lookup Techniques (See also SAS 9.2)

 (Generally horizontal as 1:1 or 1:M, Use PROC APPEND for vertical joins)

 Name DEMOG  VITALS 
 Structure

1 : 1 

1 : M 

 # of OBS

10 

20 

 Key Variables

 PATNO

PATNO, VISITDT, TIME 

* Instead of PROC SQL, DATA step is a better option to standardize variable length as a pre step before merging common structure datasets;

data class;

 length name $18.;

 set class (rename=(name=org_name));
 name = trim(left(org_name));
 drop org_name;
run;

______________________________________________


3. FIRST. and LAST. values with multiple BY variables (SAS Syntax, SAS Blog)

 REGION BRANCH  FIRST.REGION  LAST.REGION  FIRST.BRANCH  LAST.BRANCH 

  north 

  100 

 1 

  0 

  1 

 0 

 north

 100

 0

 0

 0

 1

 north

  105 

 0

 0

 1

 1

 north

 110

 0

1

 1

 1

 south

 110

 1

 1

 1 

 1

* Create by patient processing and summarize to one record per patient;

proc sort data=dada; by patient; run;

* Summarize to one record per patient;

data dada1;

 length dada_startdt dada_stopdt 8.;

 format dada_startdt dada_stopdt yymmdd10.;

 * Retain holding variables;

 retain hdada_startdt hdada_stopdt ;

 set dada;

 by patient;

 * Store first record for patient;

 if first.patient then do;

 hdada_startdt = start_date;

 hdada_stoptdt = stop_date;

 end;

 * Update holding values if larger or smaller values;

 if start_date < hdada_startdt then hdada_startdt = start_date;

 if stop_date > hdada_stoptdt then hdada_stoptdt = stop_date;

 * Save to final variables and output record;

 if last.patient then do;

 dada_startdt = hdada_startdt;

 dada_stoptdt = hdada_stoptdt;

 output;

 end;

 keep patient dada_startdt dada_stoptdt;

run;

proc sort data=dada1; by patient; run;

* QC using proc means for numeric vars and proc freq (obs=20) for char vars;

* Another Example;

proc sort data=sashelp.class out=class;
 by sex name;
run;

data a;
 set class;

 * By variables - first level is sex, second level is name;
 * Sex indicator flags first and last sex records before it changes;
 * Name is unique variable so will equal first and last will always equal 1;
 * Combination of sex value and sex indicator can be used to identify records;
  by sex name;

 * RETAIN and SUM in one statement;
 count + 1;

 * By variable indicators;
 fsex=first.sex;
 lsex=last.sex;
 fname=first.name;
 lname=last.name;

 * Resets count to 1 for each new sex;
 * last.sex will contain the total count of sex;
 if first.sex then count=1;
run;

* To create line counter within each dosegrp, use the code below;

proc sort data=comb; by descending pg doseord dosegrp siteid subjid visitdt visit unsseq ;run;

data comb;
  retain page2 page3;
  set comb;
  by descending pg doseord dosegrp siteid subjid visitdt visit unsseq ;

  * create page2 variable for each new page.  Increase by one for each
new dosegrp or mod of page3 - decrease the number to prevent extra
rows in next page;
  * create page3 variable as the line counter within dosegrp.  It gets
reset for each new dosegrp;

 * make sure NOT to include statements to assign page2 or page3 to missing;
  if _n_ = 1 then do;
   page2=1; page3=1;
  end;
  if first.dosegrp then page3=1;

  page3 + 1;
  if first.dosegrp or mod(page3, 22) = 0 then page2 + 1;
run;

______________________________________________

4. Calculations based on differences between one or more variables (MAX(), RETAIN)

PATNO BWEIGHT WEIGHT MAX(BWEIGHT, WEIGHT) Comments

101

100

100

100 Same weight as baseline.

101

100

120

120 Maximum of baseline or current weight.

______________________________________________

5. Calculations based on differences between one or more variables across rows (Apply caution when combining LAG() or DIF() functions with conditions with FIRST. or LAST. variables)

PATNO WEIGHT LAG(WEIGHT) DIF(WEIGHT) Comments - See SAS Paper, Syntax

101

100

.

. Missing

101

110

100

10 Previous record's value, useful to calculate change from previous value.  DIF() allows difference from previous value without using the LAG() function.

data temp.lab2;
 retain auc2;
 set temp.lab2;

 by subjid lbtestid visitdtn visitlbl timept;

 if visitlbl in ('Pre-Dosing DEX1' 'Week 7 Day 1') and result > . and
 rpredex1 > . then abchgdex=result - rpredex1; else abchgdex=.;

 presult=lag(result);
 ptimept=lag(timept);

 if first.visitlbl then do;
  presult=.; ptimept=.; auc2=.;
 end;

 if visitlbl in ('Pre-Dosing OGTT' 'Week 7 Day 3') and result > . and
 presult > . then auc=(result + presult)*(timept - ptimept)/2;
 auc2=sum(auc2, auc);

 if abchgdex > . and rpredex1 > . then pchgdex=(abchgdex/rpredex1)*100;
run;

* Create macro variable containing quoted characters;

%let outfilename=t_demog;
data xxx;
 tf_txt='title'; tf='t'; tfnum=1;
 alltext=left(trim(tf_txt)) || '3   ' || " %nrbquote(")" || ' ' || '
Page ~{pageof}' || "%nrbquote(";)" ;

 call symput("&outfilename" || '_' || tf || strip(put(tfnum,2.)), alltext);
run;

%put &t_demog_t1;
&t_demog_t1
proc print;run;

* Two Methods to Delete External Files;

%let report=c:\this is ;
filename remove pipe "del ""&report.myfile.emf""" ;

data _null_;
 file remove ;
 stop ;
run;

filename remove clear ;

FILENAME MyFile "C:\OutputFiles\File01.xls"  ;

  DATA _NULL_ ;
    rc = FDELETE('MyFile') ;
  RUN ;

FILENAME MyFile CLEAR ;

RTF file - Data Step post-processing;
* Code adds a double-line frame border (where frame=hsides);

filename in_rtf "t_proc_report.rtf";
filename post_rtf "t_proc_report_post.rtf";

data _null_;
  infile in_rtf lrecl=4096 recfm=v length=ln;
  file post_rtf lrecl=4096 recfm=v;

  input;

  if ln gt 23 then do;
    if substr(_infile_,1,22)='\clbrdrt\brdrs\brdrw15' then do;
      _infile_='\clbrdrt\brdrdb\brdrw15'||substr(_infile_,23);
      ln+1;
    end;
    if substr(_infile_,1,22)='\clbrdrb\brdrs\brdrw15' then do;
      _infile_='\clbrdrb\brdrdb\brdrw15'||substr(_infile_,23);
      ln+1;
    end;
  end;

  put _infile_;
run;

Beginner Paper The Secret Life of a Data Step, Thomas Miron

HOW Paper A Hands-On Introduction to SAS® DATA Step Programming, Debbie Buck

 

______________________________________________

General Programming

1. The Program Data Vector As an Aid to DATA step Reasoning, Marianne Whitlock

2. How to Think Through the SAS DATA Step, Ian Whitlock

3. DATA Step Essentials, Neil Howard

4. More Tales from the Help Desk: Solutions for Simple SAS Mistakes, Bruce Gilsen

5. Get it together: Combining data with SAS® MERGE, UPDATE, and SET, Mel Widawski

6. The Essence of DATA Step Programming, Arthur Li

7. Best Practices: Use Coding Shortcuts to Avoid Typsos and Add Flexibility, Mary F. O. Rosenbloom, Kirk Paul Lafler

8. Programming Idioms Using the SET Statement, Jack E. Fuller

9. The SET Statement and Beyond:Uses and Abuses of the SET Statement, S. David Riba [END= NOBS= POINT=, IF O THEN]

10. Difference between UPDATE and MODIFY.

11. Combining Summary Level Data with Individual Records, Frank Ivis

12. Essentials of PDV: Directing the Aim to Understanding the DATA Step!, Arthur Xuejun Li

13. The Use and Abuse of the Program Data Vector, Jim Johnson

14. THE FUNDAMENTALS OF DATA STEP PROGRAMMING I: THE ESSENCE OF DATA STEP PROGRAMMING, Arthur Li

15. SAS® Data Views Simply Stated, James C. Stokes

16. Introduction to DATA Step Views, Jeffrey Polzin

17. The Importance of the IN:, Anders Skollermo [In Dataset option]

18. Overcome Programming Problems with SAS® Automatic Variable N  [_N_]

19. If _n_=1 then SET, Paul McDonald

20. Understanding the SAS® DATA Step and the Program Data Vector, Steven First

21. What You’re Missing About Missing Values, Christopher J. Bost [Call Missing()]

22. Missing Values in SAS, Magnus Mengelbier [Call Missing]

23. Using SAS® Variable Lists Effectively, Howard Schreier

24. A Quick View of SAS® Views, Elizabeth Axelrod

25. Programming with the KEEP, RENAME, and DROP Data Set Options, Stephen Philp

26. Renaming SAS® Variables, Imelda C. Go

27. If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE, Jay Iyengar

28. Data Step Internals: Compile and Execute, Neil Howard

29. SAS DATA Step Merge – A Powerful Tool, Dalia Kahane

______________________________________________

Obstacles to Appending Datasets


Data Management Techniques

1. Manipulating SAS Datasets: An Introduction, Richard Alonso

2. An Introduction to Perl Regular Expressions in SAS 9, Ron Cody

3. How to Use the WHERE Statement, Juliana Meimei Ma

4. MISSOVER, TRUNCOVER, and PAD, OH MY!! or Making Sense of the INFILE and INPUT Statements, Randall Cates

5. IFC and IFN Functions: Alternatives to Simple DATA Step IF-THEN-ELSE, SELECT-END Code and
PROC SQL CASE Statements, Thomas E. Billings

6. Tips for Manipulating Data, Marge Scerbo

7. Matching Data Using Sounds-Like Operators and SAS® Compare Functions, Amanda Roesch

8. The Bylaws of By Group Processing, Tracee Vinson-Sorrentino

9. Analyzing Multiple Records per Subject With BY Processing and RETAIN.EasierThanYouThink…., Carter Sevick

10. READY, SET, RETAIN, AND THEN MAYBE RESET, Lisa Fine

11. Exploring RETAIN statement, Mingxia Chen [Proc SQL]

12. Slicing and Dicing the SAS® Data Set, William Murphy [POINT=]

13. Performing Multiple Statements for Each Record in a SAS® Data Set, Edward Moore

14. The Data Set statement: What are my options?, Mark Kirkham

15. Snapshots from Hell: Practical Issues in Data Warehousing, Tyson Lee

16. %whatChanged: A Tool for the Well-Behaved Macro, Frank DiIorio

17. Snapshot SNAFU: Preventative Measures to Safeguard Deliveries, Spencer Childress

18. Colon(:)izing My Programs, Jinson Erinjeri [Shorthand]

19. Multiple Set Statements in a Data Step: A Powerful Technique for Combining and Aggregating Complex Data, Renu Gehring

20. SAS® DATA Step Merge – A Powerful Tool, Dalia Kahane [Common non-by variables]

21. Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS® Code, Joshua Horstman [Mutually Exclusive]

22. That Mysterious Colon (:), Haiping Luo [IN:]

23. The Power of RETAIN Statement in Clinical Trial Programming by Examples, Baoxian Lan, Daniel Tsui, Shifang Liu

24. Finding Duplicates Using SAS’ FIRST. And LAST. Expressions, Clarence Jackson

25. A Closer Look at FIRST.var and LAST.var, Marge Scerbo

26. Counting the Ways to Count in SAS, Imelda Go

27. How Many Observations Are In My Data Set? Jack Hamilton

28. Preventing Data Loss when Combining SAS Datasets John Cantrell

29. Tips, Traps, and Techniques in BASE SAS for vertically combining SAS data sets, Jay Iyengar

30. Effects of the KEEP, DROP and RENAME Statements and Corresponding Data Set Options, John Whittington

31. Data Set Options: Beyond DROP, KEEP, RENAME, and WHERE, Sarah Woodruff, Ed Heaton

32. How and When to Use Where, Meimei Ma, Sandra Schlotzhauer


WHERE syntax and operators

WHERE vs. IF Statements: Knowing the Difference in How and When to Apply - PaperPresentation, Summary Sheet, SAS Institute Tip

______________________________________________


Do Loops and Arrays for grouping related variables (See SAS Knowledgebase)

 
Types of SAS Arrays
(3 x 2 x 2 x 4 x 2 x 2 = 192 arrays)
 
SAS Example - Great for referencing list of variables or constant values or both for standard process as a concise alternative to hard coding.  Very often used with do-loops.

Example of usage include updating existing variables, renaming variables or creating new variables. Can use PROC SQL to create macro variable containing a list of variable names or values. 

 Common Errors

1) Subscript out of range

2) Array has mixed variable types

3) Invalid expression

 Common SAS Solutions (See SAS Debugging)

1) Confirm array size matches number of elements and do-loop ending condition

2) Confirm all elements are numeric or character variables

3) Confirm expression within do-loop is valid

Array or New (useful for internal processing)

Purpose 1 = Dataset Variable Name

Purpose 2 = Constant value references 

Array or new name: array test(4); * test1 test2 test3 test4;

* () or [] or {} can be used;

Dataset variable names: array sales(4) dept month year amount;


Constant values: array scr(4) s1 s2 s3 s4 (20 40 50 70);
One (most popular)

Multi-Dimensional (Row, Column)

One Dimensional: array scr(4) s1 s2 s3 s4 (20 40 50 70);

Multi-Dimensional: array resp(2,5) r1c1-r1c5 r2c1-r2c5;
*r1c1 r1c2 r1c3 r1c4 r1c5 r2c1 r2c2 r2c3 r2c4 r2c5;
Numeric (default)

Character

Numeric: array scr(4) s1 s2 s3 s4 (20 40 50 70);

Character: array chr(3) $ c1 c2 c3 ('x','y','z');
Variable List notation (sequence with starting and ending #,

variable position stored in dataset,

all character or

all numeric

array mis q1-q10;

 

array atod a -- d;

 

array days (*) _CHARACTER_;

 

array sales(*) _NUMERIC_,;

Implicit (name, all elements)
  
Explicit

Implicit: array mis q1 q2 q3 q4;

Explicit: array sales(4) dept month year amount;             

            array sales(*) dept month year amount;

Permanent


Temporary

Permanent: array scr(4) s1 s2 s3 s4 (20 40 50 70);
* Remember to drop array variables;

Temporary: array scr(4) _TEMPORARY_ s1 s2 s3 s4 (20 40 50 70);


Dynamic reference through do-loop
 and index value
array sales(*) s1-s3;

do J = 1 to dim(sales);

 if sales(J) > 200 then output;

 * Short hand for the following three statements;
 * if s1 > 200 then output;
 * if s2 > 200 then output;
 * if s3 > 200 then output;
end;

This DO OVER loop can be
 used with an array when indexing of the array is not needed.

data test;

 set temp;

 array nvars _numeric_;  * References all numeric variables;

 do over nvars;

  if nvars > 3 then nvars = .; 

 * Repeat statement for each numeric variable to assign as missing if value is greater than 3;

 end;

run;

Of Operator

array gnp (*) x y z;

sumgnp = sum(of gnp(*)); is the same as below

sumgnp = sum(x,y,z);

<<<< Example of Do Loop with Arrays

Webinar: Doing More with Loops and Arrays, Art Carpenter, SAS Paper, HOW, SET Example

SAS Tip Collapsing observations within a BY-Group into a single observation (when data set has 3 or more variables)

See SAS Macro Programming for processing list of macro variables as arrays. 

1. Arrays by example, Diana Suhr

2. Try Your Hand at Arrays, Diana Suhr

3. Arrays Made Easy: An Introduction to Arrays and Array Processing, Steve First and Teresa Schudrowitz

4. The Fundamentals of DATA STEP Programming II: Looping, Arthur Li

5. Get a Grip on Arrays in just 50 Minutes!, Arthur Li

6. Get the Scoop on the Loop: How Best to Write a Loop in the DATA Step, Arthur X. Li

7. Arrays – Data Step Efficiency, Harry Droogendyk

8. How to Use ARRAYs and DO Loops: Do I DO OVER or Do I DO i?, Jennifer L Waller

9. Intro to Arrays: A Grad Student “How-To” Paper, Elisa L. Priest1, Jennifer Harper

10. Advanced Array Applications in Clinical Data Manipulation, Zaizai Lu, David Shen [Duration]

11. Using SAS® Arrays to Manipulate Data, Ben Cochran [Multi-dimensional array]

12. Fun with Fancy Arrays, Mary McDonald [Multi-dimensional array]

13. THE MANY WAYS TO EFFECTIVELY UTILIZE ARRAY PROCESSING, Arthur Li

14. SAS ® ARRAYS: A BASIC TUTORIAL, Bruce Gilsen

15. An Introduction to SAS Arrays, Andrew Kuligowski, Lisa Mendez [Beginner]

16. %DO Loop – a Simple Dynamic Programming Technique, Yunchao

17. Using Do Statements, Links, and Arrays Jimmy DeFoor

18. Do Which? Loop, Until or While? A Review Of Data Step And Macro Algorithms, Ronald Fehd

19. Understanding and Controlling Your Do-Loops, Sarah Woodruff, Toby Dunn

20. Hip Hip ARRAY…Expanding and Modifying Records With an ARRAY, Jeanina Worden, Lauren Shinaberry

21. Colonoscopy for the SAS® Programmer, William Murphy

______________________________________________

Merging Datasets by Key Variables (UNC Blog)

1. Advanced MATCH-MERGING: Techniques, Tricks, and Traps, Malachy J. Foley

2. Danger: Merge Ahead! Warning: BY Variable with Multiple Lengths!, Bob Virgile

3. Countdown of the Top 10 Ways to Merge Data, David Franklin

4. Anatomy of a Merge Gone Wrong, James Lew, Joshua Horstman

5. Who are You? Use of Soundex and Merge to Cross-Walk one Dataset to Another, Misty Johnson

6. A Many to Many Merge, Without SQL?, David Franklin 

7. Many-to-Many Merges in the DATA Step, Ed Heaton

8. DATA Step Merging Techniques: From Basic to Innovative, Art Carpenter [Hash]

9. S-M-U (Set, Merge, and Update) Revisited, Andrew T. Kuligowski

10. How Do You Use Look-up Tables? Philip R Holland

11. The Power of the BY Statement, Paul Choate, Toby Dunn

12. Using IN:( ) to Code Character Comparisons with Criteria Having Different Lengths, Paul Grant [In Operator, Proc SQL]

13. Safely Merging Many Datasets, Taylor Young [Common non-by variables]

14. Set, Match, Merge … Don’t You Love SAS®, Peter Eberhardt, Ying Liu [HOW]

15. Merging Multiple Observation Data Sets Together, Tim Kelly

16. Please don't Merge without By !!, Monal Kohli

17. Multiple Set Statements in a Data Step: A Powerful Technique for Combining and Aggregating Complex Data, Renu Gehring

18. So You Think You Can Combine Data Sets?, Christopher Bost [Differences in Type, Length, Label, Format]

______________________________________________

Data _NULL_Reporting, See also Proc Report

1. Data Step Reporting, Marge Scerbo

2. A Data Null Reporting Method, Frank Senk

3. How to Go From SAS® Data Sets to DATA NULL or WordPerfect Tables, Anne Horney

4. PROC REPORT or DATA _NULL_: Which to Use and How to Make it Easier, David W. Carr, Lanie Quintana, Akari Yamagami

______________________________________________

The Little SAS book (pdf) (examples)

Download SAS programs from Handbook or SAS® Data Step Programming 

Powered by Wild Apricot Membership Software