Sampling: First and Last Records

Saturday, August 30, 2008

Updated: Wednesday, August 30, 2023
The original post in Excel/XLMiner has been ported over to Julia, Python, and R; migrated from http://glue.umd.edu/~mmallari/ (sunsetted); and refreshed with newer references. Image Credit: https://unsplash.com/photos/a-pencil-sitting-on-top-of-a-piece-of-paper-rYwsJLyhvC4

Imagine you’ve just received a CSV file containing student mathematics performance data. The school district claims they’ve been consistent with their data collection, but you’ve been around long enough to know that “consistent” is relative. Before you invest hours building models or generating reports, you need answers to critical questions:

Did the data load properly?
Are the column names intact?
Did the data collection methods change over time?

This is where a quick sampling of first and/or last n rows shines. By examining both extremes of your dataset, you can quickly spot issues that would otherwise derail your analysis hours later.

Why Both Ends Matter

Looking at just the first few rows tells you how the data collection began. Looking at the last few rows shows you how it ended. In time-ordered data, this is especially invaluable. Perhaps the school changed their testing format midway through. Maybe they added new demographic fields in year three. Or perhaps there’s a data quality issue where recent entries are incomplete. Checking both ends takes seconds but can save you from embarrassing mistakes like training a model on inconsistent data formats or missing a critical data migration that happened between row 5,000 and row 5,001.

Ingesting the Data

Let’s load our student mathematics performance data, a CSV file named student-math.csv with columns for student information and performance.

Julia

using DataFrames
using CSV

# Load the data
student_math_jl = CSV.File("../../dataset/uci-ml-repo/student-performance/student-math.csv"; delim=";") |> DataFrames.DataFrame;
println("Dataset loaded: $(nrow(student_math_jl)) rows, $(ncol(student_math_jl)) columns")

Dataset loaded: 395 rows, 33 columns

Python

import pandas

# Load the data
student_math_py = pandas.read_csv("../../dataset/uci-ml-repo/student-performance/student-math.csv", sep=";")
print(f"Dataset loaded: {student_math_py.shape[0]} rows, {student_math_py.shape[1]} columns")

Dataset loaded: 395 rows, 33 columns

#Load data
student_math_r <- read.csv("../../dataset/uci-ml-repo/student-performance/student-math.csv", sep=";", stringsAsFactors=TRUE)
cat(sprintf("Dataset loaded: %d rows, %d columns\n", nrow(student_math_r), ncol(student_math_r)))

Dataset loaded: 395 rows, 33 columns

Exploring the First n Rows

When exploring a new dataset, one of the first things you might want to do is to take a look at the first few records. This helps you get a sense of the data structure, types of variables, and potential issues that may need cleaning or transformation. For example, if you want to view the first 12 records of the dataset, you can use the following code snippets in each language:

Julia

# View first 12 rows
first(student_math_jl, 12)

12×33 DataFrame
 Row │ school   sex      age    address  famsize  Pstatus  Medu   Fedu   Mjob      Fjob      reason      guardian  traveltime  studytime  failures  schoolsup  famsup   paid     activities  nursery  higher   internet  romantic  famrel  freetime  goout  Dalc   Walc   health  absences  G1     G2     G3
     │ String3  String1  Int64  String1  String3  String1  Int64  Int64  String15  String15  String15    String7   Int64       Int64      Int64     String3    String3  String3  String3     String3  String3  String3   String3   Int64   Int64     Int64  Int64  Int64  Int64   Int64     Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ GP       F           18  U        GT3      A            4      4  at_home   teacher   course      mother             2          2         0  yes        no       no       no          yes      yes      no        no             4         3      4      1      1       3         6      5      6      6
   2 │ GP       F           17  U        GT3      T            1      1  at_home   other     course      father             1          2         0  no         yes      no       no          no       yes      yes       no             5         3      3      1      1       3         4      5      5      6
   3 │ GP       F           15  U        LE3      T            1      1  at_home   other     other       mother             1          2         3  yes        no       yes      no          yes      yes      yes       no             4         3      2      2      3       3        10      7      8     10
   4 │ GP       F           15  U        GT3      T            4      2  health    services  home        mother             1          3         0  no         yes      yes      yes         yes      yes      yes       yes            3         2      2      1      1       5         2     15     14     15
   5 │ GP       F           16  U        GT3      T            3      3  other     other     home        father             1          2         0  no         yes      yes      no          yes      yes      no        no             4         3      2      1      2       5         4      6     10     10
   6 │ GP       M           16  U        LE3      T            4      3  services  other     reputation  mother             1          2         0  no         yes      yes      yes         yes      yes      yes       no             5         4      2      1      2       5        10     15     15     15
   7 │ GP       M           16  U        LE3      T            2      2  other     other     home        mother             1          2         0  no         no       no       no          yes      yes      yes       no             4         4      4      1      1       3         0     12     12     11
   8 │ GP       F           17  U        GT3      A            4      4  other     teacher   home        mother             2          2         0  yes        yes      no       no          yes      yes      no        no             4         1      4      1      1       1         6      6      5      6
   9 │ GP       M           15  U        LE3      A            3      2  services  other     home        mother             1          2         0  no         yes      yes      no          yes      yes      yes       no             4         2      2      1      1       1         0     16     18     19
  10 │ GP       M           15  U        GT3      T            3      4  other     other     home        mother             1          2         0  no         yes      yes      yes         yes      yes      yes       no             5         5      1      1      1       5         0     14     15     15
  11 │ GP       F           15  U        GT3      T            4      4  teacher   health    reputation  mother             1          2         0  no         yes      yes      no          yes      yes      yes       no             3         3      3      1      2       2         0     10      8      9
  12 │ GP       F           15  U        GT3      T            2      1  services  other     reputation  father             3          3         0  no         yes      no       yes         yes      yes      yes       no             5         2      2      1      1       4         4     10     12     12

# Check column types
describe(student_math_jl)

33×7 DataFrame
 Row │ variable  mean     min      median  max         nmissing  eltype
     │ Symbol    Union…   Any      Union…  Any         Int64     DataType
─────┼────────────────────────────────────────────────────────────────────
   1 │ school             GP               MS                 0  String3
   2 │ sex                F                M                  0  String1
   3 │ age       16.6962  15       17.0    22                 0  Int64
   4 │ address            R                U                  0  String1
   5 │ famsize            GT3              LE3                0  String3
   6 │ Pstatus            A                T                  0  String1
   7 │ Medu      2.74937  0        3.0     4                  0  Int64
   8 │ Fedu      2.52152  0        2.0     4                  0  Int64
  ⋮  │    ⋮         ⋮        ⋮       ⋮         ⋮          ⋮         ⋮
  27 │ Dalc      1.48101  1        1.0     5                  0  Int64
  28 │ Walc      2.29114  1        2.0     5                  0  Int64
  29 │ health    3.55443  1        4.0     5                  0  Int64
  30 │ absences  5.70886  0        4.0     75                 0  Int64
  31 │ G1        10.9089  3        11.0    19                 0  Int64
  32 │ G2        10.7139  0        11.0    19                 0  Int64
  33 │ G3        10.4152  0        11.0    20                 0  Int64
                                                           18 rows omitted

Python

# View first 12 rows
student_math_py.head(n=12)

   school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob      reason guardian  traveltime  studytime  failures schoolsup famsup paid activities nursery higher internet romantic  famrel  freetime  goout  Dalc  Walc  health  absences  G1  G2  G3
0      GP   F   18       U     GT3       A     4     4   at_home   teacher      course   mother           2          2         0       yes     no   no         no     yes    yes       no       no       4         3      4     1     1       3         6   5   6   6
1      GP   F   17       U     GT3       T     1     1   at_home     other      course   father           1          2         0        no    yes   no         no      no    yes      yes       no       5         3      3     1     1       3         4   5   5   6
2      GP   F   15       U     LE3       T     1     1   at_home     other       other   mother           1          2         3       yes     no  yes         no     yes    yes      yes       no       4         3      2     2     3       3        10   7   8  10
3      GP   F   15       U     GT3       T     4     2    health  services        home   mother           1          3         0        no    yes  yes        yes     yes    yes      yes      yes       3         2      2     1     1       5         2  15  14  15
4      GP   F   16       U     GT3       T     3     3     other     other        home   father           1          2         0        no    yes  yes         no     yes    yes       no       no       4         3      2     1     2       5         4   6  10  10
5      GP   M   16       U     LE3       T     4     3  services     other  reputation   mother           1          2         0        no    yes  yes        yes     yes    yes      yes       no       5         4      2     1     2       5        10  15  15  15
6      GP   M   16       U     LE3       T     2     2     other     other        home   mother           1          2         0        no     no   no         no     yes    yes      yes       no       4         4      4     1     1       3         0  12  12  11
7      GP   F   17       U     GT3       A     4     4     other   teacher        home   mother           2          2         0       yes    yes   no         no     yes    yes       no       no       4         1      4     1     1       1         6   6   5   6
8      GP   M   15       U     LE3       A     3     2  services     other        home   mother           1          2         0        no    yes  yes         no     yes    yes      yes       no       4         2      2     1     1       1         0  16  18  19
9      GP   M   15       U     GT3       T     3     4     other     other        home   mother           1          2         0        no    yes  yes        yes     yes    yes      yes       no       5         5      1     1     1       5         0  14  15  15
10     GP   F   15       U     GT3       T     4     4   teacher    health  reputation   mother           1          2         0        no    yes  yes         no     yes    yes      yes       no       3         3      3     1     2       2         0  10   8   9
11     GP   F   15       U     GT3       T     2     1  services     other  reputation   father           3          3         0        no    yes   no        yes     yes    yes      yes       no       5         2      2     1     1       4         4  10  12  12

# Check data types and missing values
student_math_py.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
G1             int64
G2             int64
G3             int64
dtype: object

# View first 12 rows
head(x=student_math_r, n=12)

   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
1      GP   F  18       U     GT3       A    4    4  at_home  teacher     course   mother          2         2        0       yes     no   no         no     yes    yes       no       no      4        3     4    1    1      3        6  5  6  6
2      GP   F  17       U     GT3       T    1    1  at_home    other     course   father          1         2        0        no    yes   no         no      no    yes      yes       no      5        3     3    1    1      3        4  5  5  6
3      GP   F  15       U     LE3       T    1    1  at_home    other      other   mother          1         2        3       yes     no  yes         no     yes    yes      yes       no      4        3     2    2    3      3       10  7  8 10
4      GP   F  15       U     GT3       T    4    2   health services       home   mother          1         3        0        no    yes  yes        yes     yes    yes      yes      yes      3        2     2    1    1      5        2 15 14 15
5      GP   F  16       U     GT3       T    3    3    other    other       home   father          1         2        0        no    yes  yes         no     yes    yes       no       no      4        3     2    1    2      5        4  6 10 10
6      GP   M  16       U     LE3       T    4    3 services    other reputation   mother          1         2        0        no    yes  yes        yes     yes    yes      yes       no      5        4     2    1    2      5       10 15 15 15
7      GP   M  16       U     LE3       T    2    2    other    other       home   mother          1         2        0        no     no   no         no     yes    yes      yes       no      4        4     4    1    1      3        0 12 12 11
8      GP   F  17       U     GT3       A    4    4    other  teacher       home   mother          2         2        0       yes    yes   no         no     yes    yes       no       no      4        1     4    1    1      1        6  6  5  6
9      GP   M  15       U     LE3       A    3    2 services    other       home   mother          1         2        0        no    yes  yes         no     yes    yes      yes       no      4        2     2    1    1      1        0 16 18 19
10     GP   M  15       U     GT3       T    3    4    other    other       home   mother          1         2        0        no    yes  yes        yes     yes    yes      yes       no      5        5     1    1    1      5        0 14 15 15
11     GP   F  15       U     GT3       T    4    4  teacher   health reputation   mother          1         2        0        no    yes  yes         no     yes    yes      yes       no      3        3     3    1    2      2        0 10  8  9
12     GP   F  15       U     GT3       T    2    1 services    other reputation   father          3         3        0        no    yes   no        yes     yes    yes      yes       no      5        2     2    1    1      4        4 10 12 12

# Check structure
str(student_math_r)

'data.frame':	395 obs. of  33 variables:
 $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
 $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
 $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
 $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
 $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
 $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
 $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
 $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
 $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
 $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
 $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
 $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
 $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
 $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
 $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
 $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
 $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
 $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
 $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
 $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
 $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
 $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
 $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
 $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
 $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
 $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
 $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
 $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
 $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
 $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
 $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

Exploring the Last n Rows

Similarly, looking at the last few records of a dataset can provide insights into how the data ends, which can be particularly useful for time-series data or datasets that may have been appended over time. To view the last 12 records of the dataset, you can use the following code snippets in each language:

Julia

# View last 12 rows
last(student_math_jl, 12)

12×33 DataFrame
 Row │ school   sex      age    address  famsize  Pstatus  Medu   Fedu   Mjob      Fjob      reason      guardian  traveltime  studytime  failures  schoolsup  famsup   paid     activities  nursery  higher   internet  romantic  famrel  freetime  goout  Dalc   Walc   health  absences  G1     G2     G3
     │ String3  String1  Int64  String1  String3  String1  Int64  Int64  String15  String15  String15    String7   Int64       Int64      Int64     String3    String3  String3  String3     String3  String3  String3   String3   Int64   Int64     Int64  Int64  Int64  Int64   Int64     Int64  Int64  Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ MS       M           19  R        GT3      T            1      1  other     services  other       mother             2          1         1  no         no       no       no          yes      yes      no        no             4         3      2      1      3       5         0      6      5      0
   2 │ MS       M           18  R        GT3      T            4      2  other     other     home        father             2          1         1  no         no       yes      no          yes      yes      no        no             5         4      3      4      3       3        14      6      5      5
   3 │ MS       F           18  R        GT3      T            2      2  at_home   other     other       mother             2          3         0  no         no       yes      no          yes      yes      no        no             5         3      3      1      3       4         2     10      9     10
   4 │ MS       F           18  R        GT3      T            4      4  teacher   at_home   reputation  mother             3          1         0  no         yes      yes      yes         yes      yes      yes       yes            4         4      3      2      2       5         7      6      5      6
   5 │ MS       F           19  R        GT3      T            2      3  services  other     course      mother             1          3         1  no         no       no       yes         no       yes      yes       no             5         4      2      1      2       5         0      7      5      0
   6 │ MS       F           18  U        LE3      T            3      1  teacher   services  course      mother             1          2         0  no         yes      yes      no          yes      yes      yes       no             4         3      4      1      1       1         0      7      9      8
   7 │ MS       F           18  U        GT3      T            1      1  other     other     course      mother             2          2         1  no         no       no       yes         yes      yes      no        no             1         1      1      1      1       5         0      6      5      0
   8 │ MS       M           20  U        LE3      A            2      2  services  services  course      other              1          2         2  no         yes      yes      no          yes      yes      no        no             5         5      4      4      5       4        11      9      9      9
   9 │ MS       M           17  U        LE3      T            3      1  services  services  course      mother             2          1         0  no         no       no       no          no       yes      yes       no             2         4      5      3      4       2         3     14     16     16
  10 │ MS       M           21  R        GT3      T            1      1  other     other     course      other              1          1         3  no         no       no       no          no       yes      no        no             5         5      3      3      3       3         3     10      8      7
  11 │ MS       M           18  R        LE3      T            3      2  services  other     course      mother             3          1         0  no         no       no       no          no       yes      yes       no             4         4      1      3      4       5         0     11     12     10
  12 │ MS       M           19  U        LE3      T            1      1  other     at_home   course      father             1          1         0  no         no       no       no          yes      yes      yes       no             3         2      3      3      3       5         5      8      9      9

Python

# View last 12 rows
student_math_py.tail(n=12)

    school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob      reason guardian  traveltime  studytime  failures schoolsup famsup paid activities nursery higher internet romantic  famrel  freetime  goout  Dalc  Walc  health  absences  G1  G2  G3
383     MS   M   19       R     GT3       T     1     1     other  services       other   mother           2          1         1        no     no   no         no     yes    yes       no       no       4         3      2     1     3       5         0   6   5   0
384     MS   M   18       R     GT3       T     4     2     other     other        home   father           2          1         1        no     no  yes         no     yes    yes       no       no       5         4      3     4     3       3        14   6   5   5
385     MS   F   18       R     GT3       T     2     2   at_home     other       other   mother           2          3         0        no     no  yes         no     yes    yes       no       no       5         3      3     1     3       4         2  10   9  10
386     MS   F   18       R     GT3       T     4     4   teacher   at_home  reputation   mother           3          1         0        no    yes  yes        yes     yes    yes      yes      yes       4         4      3     2     2       5         7   6   5   6
387     MS   F   19       R     GT3       T     2     3  services     other      course   mother           1          3         1        no     no   no        yes      no    yes      yes       no       5         4      2     1     2       5         0   7   5   0
388     MS   F   18       U     LE3       T     3     1   teacher  services      course   mother           1          2         0        no    yes  yes         no     yes    yes      yes       no       4         3      4     1     1       1         0   7   9   8
389     MS   F   18       U     GT3       T     1     1     other     other      course   mother           2          2         1        no     no   no        yes     yes    yes       no       no       1         1      1     1     1       5         0   6   5   0
390     MS   M   20       U     LE3       A     2     2  services  services      course    other           1          2         2        no    yes  yes         no     yes    yes       no       no       5         5      4     4     5       4        11   9   9   9
391     MS   M   17       U     LE3       T     3     1  services  services      course   mother           2          1         0        no     no   no         no      no    yes      yes       no       2         4      5     3     4       2         3  14  16  16
392     MS   M   21       R     GT3       T     1     1     other     other      course    other           1          1         3        no     no   no         no      no    yes       no       no       5         5      3     3     3       3         3  10   8   7
393     MS   M   18       R     LE3       T     3     2  services     other      course   mother           3          1         0        no     no   no         no      no    yes      yes       no       4         4      1     3     4       5         0  11  12  10
394     MS   M   19       U     LE3       T     1     1     other   at_home      course   father           1          1         0        no     no   no         no     yes    yes      yes       no       3         2      3     3     3       5         5   8   9   9

# View last 12 rows
tail(x=student_math_r, n=12)

    school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
384     MS   M  19       R     GT3       T    1    1    other services      other   mother          2         1        1        no     no   no         no     yes    yes       no       no      4        3     2    1    3      5        0  6  5  0
385     MS   M  18       R     GT3       T    4    2    other    other       home   father          2         1        1        no     no  yes         no     yes    yes       no       no      5        4     3    4    3      3       14  6  5  5
386     MS   F  18       R     GT3       T    2    2  at_home    other      other   mother          2         3        0        no     no  yes         no     yes    yes       no       no      5        3     3    1    3      4        2 10  9 10
387     MS   F  18       R     GT3       T    4    4  teacher  at_home reputation   mother          3         1        0        no    yes  yes        yes     yes    yes      yes      yes      4        4     3    2    2      5        7  6  5  6
388     MS   F  19       R     GT3       T    2    3 services    other     course   mother          1         3        1        no     no   no        yes      no    yes      yes       no      5        4     2    1    2      5        0  7  5  0
389     MS   F  18       U     LE3       T    3    1  teacher services     course   mother          1         2        0        no    yes  yes         no     yes    yes      yes       no      4        3     4    1    1      1        0  7  9  8
390     MS   F  18       U     GT3       T    1    1    other    other     course   mother          2         2        1        no     no   no        yes     yes    yes       no       no      1        1     1    1    1      5        0  6  5  0
391     MS   M  20       U     LE3       A    2    2 services services     course    other          1         2        2        no    yes  yes         no     yes    yes       no       no      5        5     4    4    5      4       11  9  9  9
392     MS   M  17       U     LE3       T    3    1 services services     course   mother          2         1        0        no     no   no         no      no    yes      yes       no      2        4     5    3    4      2        3 14 16 16
393     MS   M  21       R     GT3       T    1    1    other    other     course    other          1         1        3        no     no   no         no      no    yes       no       no      5        5     3    3    3      3        3 10  8  7
394     MS   M  18       R     LE3       T    3    2 services    other     course   mother          3         1        0        no     no   no         no      no    yes      yes       no      4        4     1    3    4      5        0 11 12 10
395     MS   M  19       U     LE3       T    1    1    other  at_home     course   father          1         1        0        no     no   no         no     yes    yes      yes       no      3        2     3    3    3      5        5  8  9  9

The Five-Second Insight

In those few moments examining the head and tail, you might discover:

The first year used letter grades (A, B, C) while recent years use numeric scores (0-100)
Early entries are missing demographic data that later became mandatory
The testing date format changed from MM/DD/YYYY to YYYY-MM-DD
Recent rows have a new “intervention_program” column that doesn’t exist in older data

Each of these discoveries prevents hours of debugging later. That’s the beauty of this particular method sampling: maximum insight with minimum effort.

The Bottom Line

First and last n rows sampling isn’t about statistical rigor; it’s about practical wisdom. Before you stratify, cluster, or systematically sample your way to sophisticated strategies, spend 30 seconds checking both ends of your dataset. Your future self will thank you when you’re not explaining to your team why your “comprehensive analysis” crashed because row 47,293 had a text value in a numeric column.

In data science, sometimes the simplest check is the most powerful one.

Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of Julia, Python, and R that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

Julia

using InteractiveUtils
InteractiveUtils.versioninfo()

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server

using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.6.1")

using DataFrames
using CSV

Python

import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)

Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

!pip install pandas==2.0.3
import pandas

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

Why Both Ends Matter

Ingesting the Data

Exploring the First n Rows

Exploring the Last n Rows

The Five-Second Insight

The Bottom Line

Appendix A: Environment, Language & Package Versions, and Coding Style

Further Readings