Optimizing parallel performance of under and over resampling with Tensorflow
- When training with imbalanced (very large) data set with multi class multi labeled images
    
- Learning Visual Features from Large Weakly Supervised Data
        
- Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache
 - https://arxiv.org/abs/1511.02251
 
 - Exploring the Limits of Weakly Supervised Pretraining
        
- Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, Laurens van der Maaten
 - https://arxiv.org/abs/1805.00932
 
 
 - Learning Visual Features from Large Weakly Supervised Data
        
 - Not enough with rejection sampling
    
- in case, some samples are too rare, so it normally does not exist in one batch or even in many rounds of batches.
 - rejection is a job of consuming too much of resources.
 
 - With Tensorflow dataset API
    
- over-and-under sampling with tensorflow from stackoverflow
 - Tensorflow data input pipeline performance Guide
        
- Optimizing performance : https://www.tensorflow.org/guide/performance/datasets#optimizing_performance
 - Parallelize Data Transformation : https://www.tensorflow.org/guide/performance/datasets#parallelize_data_transformation
 
 - tested with Tensorflow 1.13. Tesla P40 8 GPUs, with Intel 48 CPUs and 251 GB physical memory.
 
 - In principle, the only bottleneck of data pipeline ought be GPUs. Let me assume that GPU time cannot be reduced.
 
Offline Preparation Flow
- Collect sample distribution A
    
- crawled and/or tagged data with not too much considering of the balance between class labels
 
 - Design resampled(target) distribution B
    
- considered for the class balance of labels
 
 - Make a transformation (vector) T from A to B
 - Each column is (re)sampling probability from samples in A to make samples in B
 - If colume value p in T is less than 1, some of samples 100 * (1 - p) % are goint to be dropped.
    
- p is a survival probability. i.e. (1-p) is a drop probability.
 
 - If columen value p is more than 1, samples are replicated p times.
    
- p is a replication ratio.
 
 - Let’s call p as a ratio factor
 
Online Training Flow
- Loading tfrecords
 - shuffle data records in memory (large shuffle)
 - parsing record labels
 - giving uniform probability(bet) for each record
 - under resampling with map and flat_map
    
- drop some of records whose probability(bet) is lower than each label’s threshold(=ratio factor) given T
 
 - over resampling with map and flat_map
    
- replicate some of records N times whose ratio factor p(=N.xxx) given T is more than a integer 1 (N > 1)
 
 - shuffle data samples in memory (small shuffle)
 - parsing record images
    
- decode jpegs into Tensors
 
 - make batch from parsed records
    
- prefetch before and after (parsing)map-and-batch.
 
 - train with batches
 
Disk I/O, Memory, CPUs, Bus I/O, and GPUs parallelism
- Because of under sampling, resampling wastes much of disk I/O.
    
- Lots of samples are dropped right after they are loaded from disks.
 - So with undersampling, disk I/O is much heavier burden compared to the ordinary sampling.
 - And because of low latency of disk I/O, it takes too long to wait to load next data records after computation.
        
- dropping-out can be continued indefinitely
 
 - So prefetch is necessary between file reading and record parsing.
 
 - The records in data files might be unshuffled.
    
- So shuffle data right after they are loaded into memory
 
 - Record parsing is CPU’s job.
    
- Having multi CPUs, map with parallel_calls is useful for concurrent parsing.
 
 - Undersampling before oversampling
    
- undersampling dumps away some of records.
 - it reduces the size of list of data sequences.
 - it is waste of CPU resources to work with the data (in short time after) going to be dropped out.
 - tf.dataset.filter is not useful, because it doesn’t provide parallelism.
 - map with parallel_calls is useful, but map handles the internal contents of records, not the record set itself.
 - to handle duplication or elimination of records, flat_map is necessary.
 - like this
        
dataset = dataset.map(undersample_filter_fn, num_parallel_calls=num_parallel_calls) dataset = dataset.flat_map(lambda x : x)flat_map with the identity lambda function is just for merging survived (and empty) records
#parallel calls of map('A'), map('B'), and map('C') map('A') = 'AAAAA' # replication of A 5 times map('B') = '' # B is dropped map('C') = 'CC' # replication of C twice # merging all map results flat_map('AAAA,,CC') = 'AAAACC'https://www.tensorflow.org/images/datasets_parallel_map.png

 
 - decoding compressed images like jpeg is a totally CPU bound job
    
- and can be parallelized in map with parallel calls and should be.
 
 - (Shuffle) buffer(list) of decoded image tensors occupy very large bulk of memory.
    
- Transmitting of image tensors causes huge bus I/O, because of memory copying.
 - So after decode image files like JPEG, any operation which needs mem-copy should be minimized.
 
 - Not to make GPUs hang out, disk I/O, bus I/O, CPU resource should not be exausted at all times.
 
# Summary
- All ops in order
    
- file load with parallel interleave
 - prefetch
 - large shuffle
 - parallel map for parse_record(label)
 - parallel map for undersample
 - flat_map
 - prefetch
 - parallel map for oversample
 - flat_map
 - small shuffle
 - prefetch
 - parallel map for parse_record(image)
 - prefetch
 - batch