In recent years, there has been a considerable rise in the applications in which object or image categorization is beneficial for example, analyzing medicinal images, assisting persons to organize their collections of photos, recognizing what is around self-driving vehicles, and many more. These applications necessitate accurately labeled datasets, in their majority involve an extensive diversity in the types of images, from cats or dogs to roads, landscapes, and so forth. The fundamental aim of image categorization is to predict the category or class for the input image by specifying to which it belongs. For human beings, this is not a considerable thing, however, learning computers to perceive represents a hard issue that has become a broad area of research interest, and both computer vision techniques and deep learning algorithms have evolved. Conventional techniques utilize local descriptors for finding likeness between images, however, nowadays; progress in technology has provided the utilization of deep learning algorithms, especially the Convolutional Neural Networks (CNNs) to auto-extract representative image patterns and features for classification The fundamental aim of this paper is to inspect and explain how to utilize the algorithms and technologies of deep learning to accurately classify a dataset of images into their respective categories and keep model structure complication to a minimum. To achieve this aim, must focus precisely and accurately on categorizing the objects or images into their respective categories with excellent results. And, specify the best deep learning-based models in image processing and categorization. The developed CNN-based models have been proposed and a lot of pre-training models such as (VGG19, DenseNet201, ResNet152V2, MobileNetV2, and InceptionV3) have been presented, and all these models are trained on the Caltech-101 and Caltech-256 datasets. Extensive and comparative experiments were conducted on this dataset, and the obtained results demonstrate the effectiveness of the proposed models. The obtained results demonstrate the effectiveness of the proposed models. The accuracy for Caltech-101 and Caltech-256 datasets was (98.06% and 90%) respectively.
Video prediction theories have quickly progressed especially after a great revolution of deep learning methods. The prediction architectures based on pixel generation produced a blurry forecast, but it is preferred in many applications because this model is applied on frames only and does not need other support information like segmentation or flow mapping information making getting a suitable dataset very difficult. In this approach, we presented a novel end-to-end video forecasting framework to predict the dynamic relationship between pixels in time and space. The 3D CNN encoder is used for estimating the dynamic motion, while the decoder part is used to reconstruct the next frame based on adding 3DCNN CONVLSTM2D in skip connection. This novel representation of skip connection plays an important role in reducing the blur predicted and preserved the spatial and dynamic information. This leads to an increase in the accuracy of the whole model. The KITTI and Cityscapes are used in training and Caltech is applied in inference. The proposed framework has achieved a better quality in PSNR=33.14, MES=0.00101, SSIM=0.924, and a small number of parameters (2.3 M).