Demo¶

This demo shows the usage of dhSegment for page document extraction. It trains a model from scratch (optional) using the READ-BAD dataset [GruningLD+18] and the annotations of Pagenet [TDW+17] (annotator1 is used). In order to limit memory usage, the images in the dataset we provide have been downsized to have 1M pixels each.

How to

If you have not yet done so, clone the repository :

git clone https://github.com/dhlab-epfl/dhSegment.git

1. Get the annotated dataset here, which already contains the folders images and labels for training, validation and testing set. Unzip it into demo/pages.

cd demo/
wget https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/pages.zip
unzip pages.zip
cd ..

(Only needed if training from scratch) Download the pretrained weights for ResNet :
```
cd pretrained_models/
python download_resnet_pretrained_model.py
cd ..
```

3. You can train the model from scratch with: python train.py with demo/demo_config.json but because this takes quite some time, we recommend you to skip this and just download the provided model (download and unzip it in demo/model)

cd demo/
wget https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/model.zip
unzip model.zip
cd ..

4. (Only if training from scratch) You can visualize the progresses in tensorboard by running tensorboard --logdir . in the demo folder.

Run python demo.py
Have a look at the results in demo/processed_images