Ciencia de datos: una revisión del estado del arte

Naivy Pujol Méndez, Joelsy Porven Rubier

Resumen


Con un crecimiento explosivo en datos no estructurados y estructurados, las organizaciones buscan formas de innovar a través del análisis y de la ciencia de datos; la disponibilidad de Big Data permite a las organizaciones de todas las industrias aprovechar el análisis de datos. Por tanto, el objetivo de este artículo es realizar una revisión del estado del arte referente a la ciencia de datos. Se realizó un estudio inicial para determinar los temas y términos más representativos en el campo de la ciencia de datos y se utilizaron los métodos de investigación analítico-sintético e histórico-lógico para examinar los elementos fundamentales y característicos de la ciencia de datos y los científicos de datos; y para determinar los diferentes procesos, soluciones, herramientas y la evolución de estas en el transcurso del tiempo. Las principales conclusiones arribadas se encuentran: la amplia aplicación de la ciencia de datos, trae como consigo que existan muchas soluciones diferentes, estrechamente relacionados con el área de aplicación y las características del problema; propiciado por Big Data en la mayoría de las ocasiones se utiliza el aprendizaje automático para resolver los problemas; las técnicas más utilizados son los siguientes: regresión lineal, k-Nearest Neighbors (k-NN), k-means, regresión logística, redes bayesianas, máquina de soporte vectorial y redes neuronales.

Palabras clave: Ciencia de datos; Científico de datos; Aprendizaje automático


Texto completo:

PDF

Referencias


Adeniyi, D.A., Wei, Z., Yongquan, Y., 2016. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method. Appl. Comput. Inform. 12, 90–108. https://doi.org/10.1016/j.aci.2014.10.001

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M., 2018. How Data Mining and Machine Learning Evolved from Relational Data Base to Data Science, in: A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years. Springer, Cham, pp. 287–306. https://doi.org/10.1007/978-3-319-61893-7_17

Anzola, N.S., 2016. Máquinas de soporte vectorial y redes neuronales artificiales en la predicción del movimiento USD/COP spot intradiario. ODEON 0, 113–172.

Ayankoya, K., Calitz, A., Greyling, J., 2014. Intrinsic Relations Between Data Science, Big Data, Business Analytics and Datafication, in: Proceedings of the Southern African Institute for Computer Scientist and Information Technologists Annual Conference 2014 on SAICSIT 2014 Empowered by Technology, SAICSIT ’14. ACM, New York, NY, USA, p. 192:192–192:198. https://doi.org/10.1145/2664591.2664619

Baxevanis, A.D., Ouellette, B.F.F., 2004. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons.

Blei, D.M., Smyth, P., 2017. Science and data science. Proc. Natl. Acad. Sci. 114, 8689–8692. https://doi.org/10.1073/pnas.1702076114

Botía, J.A., Vandrovcova, J., Forabosco, P., Guelfi, S., D’Sa, K., Hardy, J., Lewis, C.M., Ryten, M., Weale, M.E., 2017. An additional k-means clustering step improves the biological features of WGCNA gene co-expression networks. BMC Syst. Biol. 11, 47. https://doi.org/10.1186/s12918-017-0420-6

Breiman, L., others, 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Stat. Sci. 16, 199–231.

Cao, L., 2017. Data Science: A Comprehensive Overview. ACM Comput Surv 50, 43:1–43:42. https://doi.org/10.1145/3076253

Chen, J., Xiao, T., Sheng, J., Teredesai, A., 2017. Gender prediction on a real life blog data set using LSI and KNN, in: 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC). pp. 1–6. https://doi.org/10.1109/CCWC.2017.7868410

Chojnacki, A., Dai, C., Farahi, A., Shi, G., Webb, J., Zhang, D.T., Abernethy, J., Schwartz, E., 2017. A Data Science Approach to Understanding Residential Water Contamination in Flint, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 1407–1416. https://doi.org/10.1145/3097983.3098078

Cleveland, W.S., 2001. Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics. Int. Stat. Rev. 69, 21–26. https://doi.org/10.1111/j.1751-5823.2001.tb00477.x

Dalkir, K., Beaulieu, M., 2017. Knowledge Management in Theory and Practice. MIT Press.

Dhar, V., 2013. Data Science and Prediction. Commun ACM 56, 64–73. https://doi.org/10.1145/2500499

Duzhin, F., Gustafsson, A., 2018. Machine Learning-Based App for Self-Evaluation of Teacher-Specific Instructional Style and Tools. Educ. Sci. 8. https://doi.org/10.3390/educsci8010007

Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S., 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118.

Fayyad, U.M., Simoudis, E., Srivastava, A., 2017. Foreword to the Applied Data Science: Invited Talks Track at KDD-2017, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 7–8. https://doi.org/10.1145/3097983.3121426

Fowler, B., Rajendiran, M., Schroeder, T., Bergh, N., Flower, A., Kang, H., 2017. Predicting patient revisits at the University of Virginia Health System Emergency Department, in: 2017 Systems and Information Engineering Design Symposium (SIEDS). pp. 253–258. https://doi.org/10.1109/SIEDS.2017.7937726

Gendelman, R., Xing, H., Mirzoeva, O.K., Sarde, P., Curtis, C., Feiler, H.S., McDonagh, P., Gray, J.W., Khalil, I., Korn, W.M., 2017. Bayesian Network Inference Modeling Identifies TRIB1 as a Novel Regulator of Cell-Cycle Progression and Survival in Cancer Cells. Cancer Res. 77, 1575–1585. https://doi.org/10.1158/0008-5472.CAN-16-0512

Giama, E., Papadopoulos, A.M., 2018. Carbon footprint analysis as a tool for energy and environmental management in small and medium-sized enterprises. Int. J. Sustain. Energy 37, 21–29. https://doi.org/10.1080/14786451.2016.1263198

Gould, R., Wild, C.J., Baglin, J., McNamara, A., Ridgway, J., McConway, K., 2018. Revolutions in Teaching and Learning Statistics: A Collection of Reflections, in: International Handbook of Research in Statistics Education. Springer, Cham, pp. 457–472. https://doi.org/10.1007/978-3-319-66195-7_15

Hazen, B.T., Boone, C.A., Ezell, J.D., Jones-Farmer, L.A., 2014. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int. J. Prod. Econ. 154, 72–80. https://doi.org/10.1016/j.ijpe.2014.04.018

Hey, T., 2012. The Fourth Paradigm – Data-Intensive Scientific Discovery, in: E-Science and Information Management. Springer, Berlin, Heidelberg, pp. 1–1. https://doi.org/10.1007/978-3-642-33299-9_1

Huang, D., Luo, L., 2016. Consumer Preference Elicitation of Complex Products Using Fuzzy Support Vector Machine Active Learning. Mark. Sci. 35, 445–464. https://doi.org/10.1287/mksc.2015.0946

Huda, S.N., 2017. Cluster Analysis of Indonesian Province Based on Household Primary Cooking Fuel Using K-Means. IOP Conf. Ser. Mater. Sci. Eng. 185, 012016. https://doi.org/10.1088/1757-899X/185/1/012016

Hunter, M.C., Pozhitkov, A.E., Noble, P.A., 2017. Accurate predictions of postmortem interval using linear regression analyses of gene meter expression data. Forensic Sci. Int. 275, 90–101. https://doi.org/10.1016/j.forsciint.2017.02.027

Kamper, H., Livescu, K., Goldwater, S., 2017. An embedded segmental K-means model for unsupervised segmentation and clustering of speech. ArXiv170308135 Cs.

Karpatne, A., Kumar, V., 2017. Big Data in Climate: Opportunities and Challenges for Machine Learning, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17. ACM, New York, NY, USA, pp. 21–22. https://doi.org/10.1145/3097983.3105810

Kempler, S., Mathews, T., 2017. Earth Science Data Analytics: Definitions, Techniques and Skills. Data Sci. J. 16. https://doi.org/10.5334/dsj-2017-006

Kim, Y.-K., Na, K.-S., 2018. Application of machine learning classification for structural brain MRI in mood disorders: Critical review from a clinical perspective. Prog. Neuropsychopharmacol. Biol. Psychiatry 80, 71–80. https://doi.org/10.1016/j.pnpbp.2017.06.024

Knudson, S., Sarkar, S., Ray, A., 2016. Connecting Data Science and Qualitative Interview Insights through Sentiment Analysis to Assess Migrants’ Emotion States Post-Settlement. ArXiv160908776 Cs.

Kormos, M., Collura, M., Takács, G., Calabrese, P., 2017. Real-time confinement following a quantum quench to a non-integrable model. Nat. Phys. 13, 246. https://doi.org/10.1038/nphys3934

Lee, S.-I., Celik, S., Logsdon, B.A., Lundberg, S.M., Martins, T.J., Oehler, V.G., Estey, E.H., Miller, C.P., Chien, S., Dai, J., Saxena, A., Blau, C.A., Becker, P.S., 2018. A machine learning approach to integrate big data for precision medicine in acute myeloid leukemia. Nat. Commun. 9, 42. https://doi.org/10.1038/s41467-017-02465-5

Maillo, J., Ramírez, S., Triguero, I., Herrera, F., 2017. kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data. Vol. Var. Veloc. Data Sci. 117, 3–15. https://doi.org/10.1016/j.knosys.2016.06.012

McNally, R.J., Mair, P., Mugno, B.L., Riemann, B.C., 2017. Co-morbid obsessive–compulsive disorder and depression: a Bayesian network approach. Psychol. Med. 47, 1204–1214. https://doi.org/10.1017/S0033291716003287

Molina-Solana, M., Ros, M., Ruiz, M.D., Gómez-Romero, J., Martin-Bautista, M.J., 2017. Data science for building energy management: A review. Renew. Sustain. Energy Rev. 70, 598–609. https://doi.org/10.1016/j.rser.2016.11.132

Mueller, J.P., Massaron, L., 2016. Machine learning for dummies, For dummies. John Wiley & Sons, Inc, Hoboken, NJ.

Newman, R., Chang, V., Walters, R.J., Wills, G.B., 2016. Model and experimental development for Business Data Science. Int. J. Inf. Manag. 36, 607–617. https://doi.org/10.1016/j.ijinfomgt.2016.04.004

Norbert, D., Andreas, G., Armin, K., Manuel, M., Andrea, H., 2017. Solutions for Cyber-Physical Systems Ubiquity. IGI Global.

Provost, F., Fawcett, T., 2013. Data Science for Business: What you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc.

Raith, S., Vogel, E.P., Anees, N., Keul, C., Güth, J.-F., Edelhoff, D., Fischer, H., 2017. Artificial Neural Networks as a powerful numerical tool to classify specific features of a tooth based on 3D scan data. Comput. Biol. Med. 80, 65–76. https://doi.org/10.1016/j.compbiomed.2016.11.013

Rupp, G.M., Opitz, A.K., Nenning, A., Limbeck, A., Fleig, J., 2017. Real-time impedance monitoring of oxygen reduction during surface modification of thin film cathodes. Nat. Mater. 16, 640. https://doi.org/10.1038/nmat4879

Sbarufatti, C., Corbetta, M., Giglio, M., Cadini, F., 2017. Adaptive prognosis of lithium-ion batteries based on the combination of particle filters and radial basis function neural networks. J. Power Sources 344, 128–140. https://doi.org/10.1016/j.jpowsour.2017.01.105

Schoenherr, T., Speier-Pero, C., 2015. Data Science, Predictive Analytics, and Big Data in Supply Chain Management: Current State and Future Potential. J. Bus. Logist. 36, 120–132. https://doi.org/10.1111/jbl.12082

Schuff, D., 2018. Data Science for All: A University-Wide Course in Data Literacy, in: Analytics and Data Science. Springer, Cham, pp. 281–297. https://doi.org/10.1007/978-3-319-58097-5_20

Schutt, R., O’Neil, C., 2013. Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Inc.

Shiba, K., Kaburagi, T., Kurihara, Y., 2017. Classification of State Transition by Using a Microwave Doppler Sensor for Wandering Detection 11, 3245.

Toyinbo, P.A., Vanderploeg, R.D., Belanger, H.G., Spehar, A.M., Lapcevic, W.A., Scott, S.G., 2017. A Systems Science Approach to Understanding Polytrauma and Blast-Related Injury: Bayesian Network Model of Data From a Survey of the Florida National Guard. Am. J. Epidemiol. 185, 135–146. https://doi.org/10.1093/aje/kww074

Urbano, J., Nogueira, P., Rocha, A.P., Cardoso, H.L., 2017. Analysis of Data Science Tools for Sensor-Based Assessment of Quality of Life in Health Care, in: Recent Advances in Information Systems and Technologies. Springer, Cham, pp. 446–455. https://doi.org/10.1007/978-3-319-56535-4_45

Van der Aalst, W.M., 2016. Process mining: data science in action. Springer.

Wang, Y., Priestley, J., 2017. Binary Classification on Past Due of Service Accounts using Logistic Regression and Decision Tree. Grey Lit. PhD Candidates.

Yogatama, D., Dyer, C., Ling, W., Blunsom, P., 2017. Generative and Discriminative Text Classification with Recurrent Neural Networks. ArXiv170301898 Cs Stat.

Yu, C., Wang, N., Yang, L.T., Yao, D., Hsu, C.-H., Jin, H., 2017. A semi-supervised social relationships inferred model based on mobile phone data. Future Gener. Comput. Syst. 76, 458–467. https://doi.org/10.1016/j.future.2016.11.027

Yu, L., Zhang, Y., Jian, G., Gutman, I., 2017. Classification for Microarray Data Based on K-Means Clustering Combined with Modified Single-to-Noise-Ratio Based on Graph Energy. J. Comput. Theor. Nanosci. 14, 598–606. https://doi.org/10.1166/jctn.2017.6248

Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X., 2017. PrivBayes: Private Data Release via Bayesian Networks. ACM Trans Database Syst 42, 25:1–25:41. https://doi.org/10.1145/3134428

Zhou, Y., Su, W., Ding, L., Luo, H., Love, P.E.D., 2017. Predicting Safety Risks in Deep Foundation Pits in Subway Infrastructure Projects: Support Vector Machine Approach. J. Comput. Civ. Eng. 31, 04017052. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000700


Enlaces refback

  • No hay ningún enlace refback.


UCE Ciencia. Revista de postgrado - ISSN 2306-3556 - Vicerrectoría de Estudios de Graduados - Universidad Central del Este (UCE) - Dirección: Av. Francisco Alberto Caamaño Deñó, San Pedro de Macorís, República Dominicana - Teléfono: +1 809-529-3562 Fax: +1 809-529-5146 - Correo-e: uceciencia@uce.edu.do - Sitio web: www.uceciencia.edu.do desarrollado como adaptación de Open Journal Systems - Esta revista se publica bajo una licencia que permite usar sus contenidos y generar obras derivadas, siempre y cuando esos usos no tengan fines comerciales y se reconozcan los derechos de sus autores.