Optimizing parallel performance of under and over resampling with Tensorflow

When training with imbalanced (very large) data set with multi class multi labeled images
- Learning Visual Features from Large Weakly Supervised Data
  - Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache
  - https://arxiv.org/abs/1511.02251
- Exploring the Limits of Weakly Supervised Pretraining
  - Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten
  - https://arxiv.org/abs/1805.00932
Not enough with rejection sampling
- in case, some samples are too rare, so it normally does not exist in one batch or even in many rounds of batches.
- rejection is a job of consuming too much of resources.
With Tensorflow dataset API
- over-and-under sampling with tensorflow from stackoverflow
- Tensorflow data input pipeline performance Guide
  - Optimizing performance : https://www.tensorflow.org/guide/performance/datasets#optimizing_performance
  - Parallelize Data Transformation : https://www.tensorflow.org/guide/performance/datasets#parallelize_data_transformation
- tested with Tensorflow 1.13. Tesla P40 8 GPUs, with Intel 48 CPUs and 251 GB physical memory.
In principle, the only bottleneck of data pipeline ought be GPUs. Let me assume that GPU time cannot be reduced.

Offline Preparation Flow

Collect sample distribution A
- crawled and/or tagged data with not too much considering of the balance between class labels
Design resampled(target) distribution B
- considered for the class balance of labels
Make a transformation (vector) T from A to B
Each column is (re)sampling probability from samples in A to make samples in B
If colume value p in T is less than 1, some of samples 100 * (1 - p) % are goint to be dropped.
- p is a survival probability. i.e. (1-p) is a drop probability.
If columen value p is more than 1, samples are replicated p times.
- p is a replication ratio.
Let’s call p as a ratio factor

Online Training Flow

Loading tfrecords
shuffle data records in memory (large shuffle)
parsing record labels
giving uniform probability(bet) for each record
under resampling with map and flat_map
- drop some of records whose probability(bet) is lower than each label’s threshold(=ratio factor) given T
over resampling with map and flat_map
- replicate some of records N times whose ratio factor p(=N.xxx) given T is more than a integer 1 (N > 1)
shuffle data samples in memory (small shuffle)
parsing record images
- decode jpegs into Tensors
make batch from parsed records
- prefetch before and after (parsing)map-and-batch.
train with batches

Disk I/O, Memory, CPUs, Bus I/O, and GPUs parallelism

Because of under sampling, resampling wastes much of disk I/O.
- Lots of samples are dropped right after they are loaded from disks.
- So with undersampling, disk I/O is much heavier burden compared to the ordinary sampling.
- And because of low latency of disk I/O, it takes too long to wait to load next data records after computation.
  - dropping-out can be continued indefinitely
- So prefetch is necessary between file reading and record parsing.
The records in data files might be unshuffled.
- So shuffle data right after they are loaded into memory
Record parsing is CPU’s job.
- Having multi CPUs, map with parallel_calls is useful for concurrent parsing.
Undersampling before oversampling
- undersampling dumps away some of records.
- it reduces the size of list of data sequences.
- it is waste of CPU resources to work with the data (in short time after) going to be dropped out.
- tf.dataset.filter is not useful, because it doesn’t provide parallelism.
- map with parallel_calls is useful, but map handles the internal contents of records, not the record set itself.
- to handle duplication or elimination of records, flat_map is necessary.
- like this
```
dataset = dataset.map(undersample_filter_fn, num_parallel_calls=num_parallel_calls) 
dataset = dataset.flat_map(lambda x : x) 
```
  flat_map with the identity lambda function is just for merging survived (and empty) records
```
#parallel calls of map('A'), map('B'), and map('C')
map('A') = 'AAAAA' # replication of A 5 times
map('B') = ''      # B is dropped
map('C') = 'CC'    # replication of C twice
# merging all map results
flat_map('AAAA,,CC') = 'AAAACC'
```
  https://www.tensorflow.org/images/datasets_parallel_map.png
decoding compressed images like jpeg is a totally CPU bound job
- and can be parallelized in map with parallel calls and should be.
(Shuffle) buffer(list) of decoded image tensors occupy very large bulk of memory.
- Transmitting of image tensors causes huge bus I/O, because of memory copying.
- So after decode image files like JPEG, any operation which needs mem-copy should be minimized.
Not to make GPUs hang out, disk I/O, bus I/O, CPU resource should not be exausted at all times.

# Summary

All ops in order
- file load with parallel interleave
- prefetch
- large shuffle
- parallel map for parse_record(label)
- parallel map for undersample
- flat_map
- prefetch
- parallel map for oversample
- flat_map
- small shuffle
- prefetch
- parallel map for parse_record(image)
- prefetch
- batch