Code snippets
Counting rows and unique persons
Counting number of rows in a dataset
Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
select count(*) from dset; | Use Proc SQL | nrow(dset) | count |
Counting number of unique persons
Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
select count(distinct id) from dset; | Use Proc SQL | length(unique(dset$id)) #or length(unique(dset[["id"]])) | by id, sort: gen nvals = _n == 1 count if nvals |
Basic Joins
Type of Join | Standard SQL / SAS Proc SQL | SAS Data step | R (Base R) | Stata |
Inner Join | create table mergeddata as select d1.id, d2.var from dset1 d1 inner join dset2 d2 on d1.id = d2.id; /* or */ create table mergeddata as select d1.id, d2.var from dset1 d1, dset2 d2 where d1.id = d2.id; | data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1 and d2; run; | mergeddata <- merge( dset1, dset2, by="id" ) | merge 1:1 id using "c:\data\dset2.dta" |
Left Join | create table mergeddata as select d1.id, d2.var from dset1 d1 left join dset2 d2 on d1.id = d2.id; | data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1; run; | mergeddata <- merge( dset1, dset2, by="id", all.x=TRUE ) | merge id, keep(master matched) |
Full Join | create table mergeddata as selectd1.id, d2.var fromdset1 d1 full outer joindset2 d2 ond1.id = d2.id; | data mergeddata; merge dset1 (in = d1) dset2 (in = d2); by id; if d1 or d2; run; | mergeddata <- merge( dset1, dset2, by="id", all=TRUE ) | merge id,keep(master matched using) |
Aggregating
A common task when dealing with data stored at multiple levels, for instance multiple admissions per person, is aggregating. In data terms, this refers to collapsing many values associated with a common identifier to a single value.
This typically involves the use of an aggregation function used to reduce many values down to a single value. Typical aggregation functions include numeric calculations like a count, sum, maximum, minimum or average (typically a mean). Calculations resulting in a TRUE or FALSE value can also be used in some programs.
Consider the process of making a patient-level flag for deceased patients.
The logic to do this would be something like:
1. flag all rows in the episode table that are associated with a sepn_mode of 'died'
id | sepn_mode | flag |
1 | transfer | 0 |
1 | home | 0 |
2 | home | 0 |
2 | died | 1 |
3 | home | 0 |
4 | res.care | 0 |
2. aggregate these flag values within each id group, returning a died flag for id's where any of the original values were flagged (a maximum aggregation function).
id | died |
1 | 0 |
2 | 1 |
3 | 0 |
4 | 0 |
SQL / SAS Proc SQL |
create table deathflag as select id, max(sepn_mode = 'died') as died from episode group by id; /* or if you are using Oracle SQL or similar systems that don’t have a boolean (i.e. TRUE/FALSE) variable type: */ create table deathflag as select id, max(case when sepn_mode = 'died' then 1 else 0 end) as died from episode group by id; |
R - Base R |
deathflag <- aggregate( cbind(died = sepn_mode == 'died') ~ id, data=episode, FUN=any ) |
R - data.table package* |
deathflag <- episode[, .(died = any(sepn_mode == 'died')), by=id] |
SAS Data Step / Procedures |
data episode; set episode; died = (sepn_mode = 'died'); run; proc summary data=episode nway; var died; class id; output out=deathflag (drop = _:) max=died; run; |
Stata |
gen died = sepn_mode == 'died' collapse (max) died, by(id) |
* - the data.table package is excellent for dealing with very large datasets in R, but is not strictly necessary.