Utilizing convolutional neural network (CNN) models, computer vision technology has become a reliable and powerful tool for detecting potential damage in concrete structures at the pixel level. In this study, an advanced SWIN U-Net architecture was introduced to detect concrete cracks. The model integrated attention-based convolutional neural networks to enhance the speed and accuracy of crack detection significantly. The distinctive features of the SWIN Transformer made the application of the model to images of varying sizes possible while the computational resources were used efficiently. To train the model, a dataset consisting of crack images, each accompanied by a corresponding mask that highlighted the relevant regions within the image, was used. The training data were augmented using Flip, Rotate, Random Contrast, Random Gamma, Random Brightness, Elastic Transformation, Grid Distortion, and …