The sliding window should be moved on each step by the step-size or stride (better if this is of 1 pixel, but in practice we can use say 4 pixels).
For the text detection, we can use a small fixed size sliding windows, and then paint a black and white image with the results of the classifier (white is positive).
Then we expand this image, and from that we can extract text rectangles.