Research Article
BibTex RIS Cite

Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem

Year 2019, Special Issue 2019, 444 - 451, 31.10.2019
https://doi.org/10.31590/ejosat.638404

Abstract

With the advent of e-commerce, digital services and social media, scammers have changed their way to gain illegal benefits in various forms such as capturing the credit card information or exploiting personal cloud accounts which is termed as phishing. For this reason, against this cyber crime, last two decades have witnessed a variety of combatting methodologies like HTML content based similarity analysis, URL based classification and recently visual similarity based matching since phishing web pages visually mimic to their legitimate counterparts in order to create an illusion to deceive innocent users. To this end, in this study, we propose a computer vision and machine learning based approach in order to classify whether a suspicious web page is phishing and further recognize its original brand name. In this regard, we have utilized and investigated two different local image descriptors namely Scale Invariant Feature Transform (SIFT) and DAISY. Apart from their common properties such as scale invariance, the aforementioned descriptors have apparent differences such that in addition to rotational invariance, SIFT employs key-point based sampling whereas DAISY applies dense sampling by default. Therefore, we first aimed to investigate the feasibility of these two local image descriptors in addition to revealing the effects of sampling strategy and rotational invariance in problem domain. Furthermore, in order to create a discriminative representation of a web page, we followed the bag of visual words (BOVW) approach having different vocabulary sizes such as 50, 100, 200 and 400. In order to evaluate the proposed approach, we have utilized a publicly available phishing dataset including snapshots of webpages sampled from both 14 different highly phished brands and ordinary legitimate web pages yielding a challenging open-set problem. The aforementioned dataset involves 1313 training and 1539 testing image samples in total. The visual features extracted via SIFT and DAISY were first transformed to a BOVW histogram and fed to three different machine learning methods such as SVM, Random Forest and XGBoost. According to the conducted experiments, based on a 400-D visual vocabulary, SIFT descriptor along with XGBoost has been found as the best descriptor-learner configuration having reached up to 89.34% validation accuracy with 0.76% false positive rate. Moreover, SIFT has outperformed DAISY descriptor in all settings. As a result, it has been shown that SIFT descriptors equipped with BOVW representation can be effectively used for brand identification of phishing web pages.

References

  • Drake, C.E., Oliver, J.J. & Koontz, E.J., (2014) Anatomy of a phishing email, In CEAS 2014.
  • Varshney, G., Misra, M. & Atrey, P.K., (2016) A survey and classification of web phishing detection schemes, Security and Communication Networks, 8, 6266-6284.
  • APWG, Phishing Attack Trends Report. Retrieved from http://docs.apwg.org/reports/apwg_trends_report_q4_2017.pdf), on (02.6.2019).
  • Dalgic, F. C., Bozkir, A. S., Aydos, M. (2018). Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (pp. 1-8). IEEE.
  • Rao, R.S. & Pais, A.R., (2018) Detection of phishing web sites using an efficient feature-based machine learning framework, Neural Computing and Applications, 1-23.
  • Lam, I.F., Xiao, W.C., Wang, S.C. and Chen, K.T., (2009) Counteracting Phishing Page Polymorphism: an Image Layout Analysis Approach, LNCS (pp. 270-279).
  • Chen, K. T., Chen, J. Y., Huang, C. R., Chen, C. S. (2009). Fighting Phishing with Discriminative Keypoint Features. IEEE Internet Computing, 13(3), 56-63.
  • Rao, R. S. & Ali, S. T. (2015). A Computer Vision Technique To detect Phishing Attacks. In 2015 Fifth International Conference on Communication Systems and Network Technologies (pp. 596-601). IEEE.
  • Bozkir, A.S. & Akcapinar Sezer, E. (2016). Use of HOG Descriptors in Phishing Detection, In 4th International Symposium on Digital Forensic and Security (ISDFS).
  • Zhang, W., Lu, H., Xu, B., Yang, H. (2013). Web phishing detection based on page spatial layout similarity. Informatica, 37(3).
  • Hara, M., Yamada., A., Miyake, Y. (2009). Visual similarity-based phishing detection without victim site information. In IEEE Symposium on Computational Intelligence in Cyber Security (pp. 30-36). IEEE.
  • Corona I, Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., Roli, F., (2017). Delta-Phish: Detecting Phishing Webpages in Compromised Websites, In ESORICS 2017.
  • Li, Y., Yang, Z., Chen, X., Yuan, H., Liu, W. (2019) A Stacking Model using URL and HTML Features for Phishing Webpage Detection, Future Generation Computer Systems, 94, 27-39
  • Lowe,D.G. (2004). Distinctive image features from scale invariant keypoints, International Journal of Computer Vision 60
  • Sachdeva, V. D. et al. (2017) Performance Evaluation of SIFT and Convolutional Neural Network for Image Retrieval, International Journal of Advanced Computer Science and Applications, 8
  • Karami, E., Shehata, M., Smith, A. (2015) Image Identification Using SIFT Algorithm: Performance Analysis against Different Image Deformations, In Newfoundland Electrical and Computer Engineering Conference
  • Keser, Reyhan. K., Ergun, E., Töreyin, B. U. (2017) Vehicle Logo Recognition with Reduced-Dimension SIFT Vectors Using Autoencoders, In International Workshop on Computational Intelligence for Multimedia Understanding
  • Tola, E., Vincent L., Pascal F., (2010) DAISY: An Efficient Dense Descriptor Applied to Wide-Baseling Stereo, IEEE Transactions on Pattern Analysis and Machine Learning, Cortes, C. & Vapnik, V. (1995) "Support-vector networks", Machine Learning, 20, 273-297
  • Tianqi C. & Guestrin C., (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining

Bir Açık Küme Problemi Olarak Yerel Görsel Betimleyicilerle Oltalayıcı Web Sayfalarının Tanınması

Year 2019, Special Issue 2019, 444 - 451, 31.10.2019
https://doi.org/10.31590/ejosat.638404

Abstract

E-ticaret, sayısal hizmetler ve sosyal medyadaki gelişmelerle birlikte siber saldırganlar illegal kazanç sağlama adına günümüzde "Oltalama" olarak ifade edilen ve kredi kartı veya kişisel bulut hesaplarına ait hesap bilgilerini ele geçirmek gibi amaçları olan yeni bir saldırı türünü benimsemişlerdir. Bu nedenle bu siber suça karşı son yirmi yılda HTML içerik temelli benzerlik analizi, URL tabanlı sınıflandırma ve masum kullanıcıları yanıltmak için sahte sayfaların özgün sürümlerini andırmasından dolayı son zamanlarda görsel benzerlik temelli eşleştirme gibi çeşitli mücadele yöntemleri geliştirilmiştir. Bu çalışmada şüpheli bir web sayfasının oltalayıcı sayfa olup olup olmadığını sınıflandırmak ve orijinal marka adını daha iyi tanımak için bilgisayar görüsü ve makine öğrenmeye dayalı bir yaklaşım önerilmiştir. Bu bağlamda Scale Invariant Feature Transform (SIFT) ve DAISY olmak üzere iki farklı yerel görsel betimleyicisi araştırılmış ve kullanılmıştır. Ölçek duyarsızlığı gibi ortak özelliklerinin yanı sıra, bahsi geçen betimleyicilerin dönme duyarsızlığına ek olarak bazı bariz farklılıkları bulunmaktadır. Örnek olarak SIFT betimleyicileri anahtar nokta temelli örnekleme uygularken, DAISY varsayılan olarak yoğun bir örneklemeyi tercih etmektedir. Bu nedenle, bu calışmada ilk önce örnekleme stratejisi ve dönel değişmezliğin problem uzayındaki sonuçlarından ziyade bu iki yerel görüntü betimleyicisinin uygulanabilirliği araştırılmıştır. Ayrıca, web sayfalarından ayırt edici bir temsil elde etmek için görsel kelime çantası (Bag of Visual Words - BOVW) yaklaşımı benimsenmiş ve 50, 100, 200 ve 400 gibi farklı kelime sayısına sahip temsiller üretilmiştir. Önerilen yaklaşımın değerlendirilmesinde oltalama saldırısına yoğunlukla maruz kalan 14 markaya ve çeşitli özgün web sayfalarına ait sayfa şipşakları içeren zorlayıcı bir veri kümesinden yararlanılmıştır. İlgili veri kümesi makine öğrenimi açısından "açık küme problemi" taşımakta ve bünyesinde toplam 1313 eğitim ve 1539 test görsel örneği ihtiva etmektedir. SIFT ve DAISY betimleyicileri ile çıkarılan görsel özellikler ilk olarak BOVW histogramına dönüştürülmüş, sonrasında SVM, Random Forest ve XGBoost gibi üç farklı makine öğrenme yöntemleri kullanılarak eğitilmiştir. Yapılan deneylere göre 400 görsel kelime dağarcığı ile yapılandırılan SIFT betimleyicileri, XGBoost ile birlikte %0.76 FPR ve %89.34 geçerleme doğruluğuna ulaşmış ve en iyi betimleyici-makine öğrenimi modeli çifti olarak tespit edilmiştir. Ayrıca, SIFT tüm konfigurasyonlarda DAISY betimleyicisindan daha iyi performans göstermektedir. Sonuç olarak, BOVW temsiline dayalı SIFT betimleyicilerinin oltalayıcı web sayfalarının hangi markaya ait olduğunun tanınmasında etkin bir şekilde kullanılabileceği gösterilmiştir.

References

  • Drake, C.E., Oliver, J.J. & Koontz, E.J., (2014) Anatomy of a phishing email, In CEAS 2014.
  • Varshney, G., Misra, M. & Atrey, P.K., (2016) A survey and classification of web phishing detection schemes, Security and Communication Networks, 8, 6266-6284.
  • APWG, Phishing Attack Trends Report. Retrieved from http://docs.apwg.org/reports/apwg_trends_report_q4_2017.pdf), on (02.6.2019).
  • Dalgic, F. C., Bozkir, A. S., Aydos, M. (2018). Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (pp. 1-8). IEEE.
  • Rao, R.S. & Pais, A.R., (2018) Detection of phishing web sites using an efficient feature-based machine learning framework, Neural Computing and Applications, 1-23.
  • Lam, I.F., Xiao, W.C., Wang, S.C. and Chen, K.T., (2009) Counteracting Phishing Page Polymorphism: an Image Layout Analysis Approach, LNCS (pp. 270-279).
  • Chen, K. T., Chen, J. Y., Huang, C. R., Chen, C. S. (2009). Fighting Phishing with Discriminative Keypoint Features. IEEE Internet Computing, 13(3), 56-63.
  • Rao, R. S. & Ali, S. T. (2015). A Computer Vision Technique To detect Phishing Attacks. In 2015 Fifth International Conference on Communication Systems and Network Technologies (pp. 596-601). IEEE.
  • Bozkir, A.S. & Akcapinar Sezer, E. (2016). Use of HOG Descriptors in Phishing Detection, In 4th International Symposium on Digital Forensic and Security (ISDFS).
  • Zhang, W., Lu, H., Xu, B., Yang, H. (2013). Web phishing detection based on page spatial layout similarity. Informatica, 37(3).
  • Hara, M., Yamada., A., Miyake, Y. (2009). Visual similarity-based phishing detection without victim site information. In IEEE Symposium on Computational Intelligence in Cyber Security (pp. 30-36). IEEE.
  • Corona I, Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., Roli, F., (2017). Delta-Phish: Detecting Phishing Webpages in Compromised Websites, In ESORICS 2017.
  • Li, Y., Yang, Z., Chen, X., Yuan, H., Liu, W. (2019) A Stacking Model using URL and HTML Features for Phishing Webpage Detection, Future Generation Computer Systems, 94, 27-39
  • Lowe,D.G. (2004). Distinctive image features from scale invariant keypoints, International Journal of Computer Vision 60
  • Sachdeva, V. D. et al. (2017) Performance Evaluation of SIFT and Convolutional Neural Network for Image Retrieval, International Journal of Advanced Computer Science and Applications, 8
  • Karami, E., Shehata, M., Smith, A. (2015) Image Identification Using SIFT Algorithm: Performance Analysis against Different Image Deformations, In Newfoundland Electrical and Computer Engineering Conference
  • Keser, Reyhan. K., Ergun, E., Töreyin, B. U. (2017) Vehicle Logo Recognition with Reduced-Dimension SIFT Vectors Using Autoencoders, In International Workshop on Computational Intelligence for Multimedia Understanding
  • Tola, E., Vincent L., Pascal F., (2010) DAISY: An Efficient Dense Descriptor Applied to Wide-Baseling Stereo, IEEE Transactions on Pattern Analysis and Machine Learning, Cortes, C. & Vapnik, V. (1995) "Support-vector networks", Machine Learning, 20, 273-297
  • Tianqi C. & Guestrin C., (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining
There are 19 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Articles
Authors

Ahmet Selman Bozkır 0000-0003-4305-7800

Murat Aydos This is me 0000-0002-7570-9204

Publication Date October 31, 2019
Published in Issue Year 2019 Special Issue 2019

Cite

APA Bozkır, A. S., & Aydos, M. (2019). Local Image Descriptor Based Phishing Web Page Recognition as an Open-Set Problem. Avrupa Bilim Ve Teknoloji Dergisi444-451. https://doi.org/10.31590/ejosat.638404