Code snippets

Counting rows and unique persons

Counting number of rows in a dataset

Standard SQL / SAS Proc SQL	SAS Data step	R (Base R)	Stata
select count(*) from dset;	Use Proc SQL	nrow(dset)	count

Counting number of unique persons

Standard SQL / SAS Proc SQL	SAS Data step	R (Base R)	Stata
select count(distinct id) from dset;	Use Proc SQL	length(unique(dset$id)) #or length(unique(dset[["id"]]))	by id, sort: gen nvals = _n == 1 count if nvals

Basic Joins

Type of Join	Standard SQL / SAS Proc SQL	SAS Data step	R (Base R)	Stata
Inner Join	create table mergeddata as select d1.id, d2.var from dset1 d1 inner join dset2 d2 on d1.id = d2.id; /* or */ create table mergeddata as select d1.id, d2.var from dset1 d1, dset2 d2 where d1.id = d2.id;	data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1 and d2; run;	mergeddata <- merge( dset1, dset2, by="id" )	merge 1:1 id using "c:\data\dset2.dta"
Left Join	create table mergeddata as select d1.id, d2.var from dset1 d1 left join dset2 d2 on d1.id = d2.id;	data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1; run;	mergeddata <- merge( dset1, dset2, by="id", all.x=TRUE )	merge id, keep(master matched)
Full Join	create table mergeddata as selectd1.id, d2.var fromdset1 d1 full outer joindset2 d2 ond1.id = d2.id;	data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1 or d2; run;	mergeddata <- merge( dset1, dset2, by="id", all=TRUE )	merge id,keep(master matched using)

Aggregating

A common task when dealing with data stored at multiple levels, for instance multiple admissions per person, is aggregating. In data terms, this refers to collapsing many values associated with a common identifier to a single value.

This typically involves the use of an aggregation function used to reduce many values down to a single value. Typical aggregation functions include numeric calculations like a count, sum, maximum, minimum or average (typically a mean). Calculations resulting in a TRUE or FALSE value can also be used in some programs.

Consider the process of making a patient-level flag for deceased patients.

The logic to do this would be something like:

1. flag all rows in the episode table that are associated with a sepn_mode of 'died'

id	sepn_mode	flag
1	transfer	0
1	home	0
2	home	0
2	died	1
3	home	0
4	res.care	0

2. aggregate these flag values within each id group, returning a died flag for id's where any of the original values were flagged (a maximum aggregation function).

id	died
1	0
2	1
3	0
4	0

SQL / SAS Proc SQL

create table deathflag as 
select    id,
          max(sepn_mode = 'died') as died
from      episode
group by  id;

/*
or if you are using Oracle SQL or similar systems that don’t have 
a boolean (i.e. TRUE/FALSE) variable type:
*/
create table deathflag as 
select    id,
          max(case when sepn_mode = 'died' then 1 else 0 end) as died
from      episode
group by  id;

R - Base R

deathflag <- aggregate(
  cbind(died = sepn_mode == 'died') ~ id,
  data=episode,
  FUN=any
)

R - data.table package*

deathflag <- episode[, .(died = any(sepn_mode == 'died')), by=id]

SAS Data Step / Procedures

data episode;
  set  episode;
  died = (sepn_mode = 'died');
run;
proc summary data=episode nway;
  var      died;
  class    id;
  output   out=deathflag (drop = _:)
           max=died;
run;

Stata

gen died = sepn_mode == 'died'
collapse (max) died, by(id)

* - the data.table package is excellent for dealing with very large datasets in R, but is not strictly necessary.

Last updated: 6 November 2018