Saturday, September 19, 2009

ABAP RegEx met CSV and said 'I'll keep a Tab on you'


I wrote about XML processing in my previous posts:
As much as we like to use XML for data-exchange, there are still a lot of legacy(pre-XML) data files floating between the systems. CSV files are used for ages and there are already some functions available in ABAP for the purpose. But all these standard or custom methods of parsing CSV, require more than 10 lines of code.
[ Image and comic from XKCD ]



Arrival of RegEx [ Regular Expressions ] in ABAP should be able to put an end to the miseries, developers have been facing with string ( match/find/replace ) operations.

If you would like to start with simple code, see some examples here. Also, in my attempt to learn RegEx by examples, I tried to write a CSV parser in ABAP and the code-snippet for converting the CSV string-line is as below:
REPORT  z_ram_test_csv_to_itab_regex.

DATA : g_input TYPE string. "g_input(100) TYPE C.

g_input = '"F1",F2,""Hello",  "How" are "you"","F3"'.

REPLACE ALL OCCURRENCES OF
  REGEX ',(?=(?:[^"]*$)|(?:[^"]*"[^"]*"[^"]*)*$)'
  IN g_input WITH ';'.

REPLACE FIRST OCCURRENCE OF
  REGEX '^"'
  IN g_input WITH ' '.

REPLACE ALL OCCURRENCES OF
  REGEX '";"|";|;"'
  IN g_input WITH ';'.

REPLACE ALL OCCURRENCES OF '""' IN g_input
  WITH '"'.    "Added during testing

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

WRITE :/ g_input. 
"Now you can use SPLIT on ';' to get the values in itab
To keep it simple for the explanation, I replaced comma ( , ) in CSV with ( ; ). However,  Tab ( cl_abap_char_utilities=>HORIZONTAL_TAB ) should be used for productive code. I've not done complete testing of this code so can't be sure if it complies with all the CSV specifications. But in case you managed to test this code then please pass on your feedback as comments. And please do test this before using it for production purpose. Test cases can be adapted from the Wiki examples.

Some developers simply split the string on comma, without realising that there can be a comma within one of the text-fields. If not handled carefully, this could be really dangerous and potentially corrupt the database. Till your custom code is properly tested, or if you have an old SAP release, you should stick with the sap standard functions for parsing data. e.g. Function KCD_CSV_FILE_TO_INTERN_CONVERT or a custom adaptation of the same function.

OK so I don't understand much of this code either ;) . But I am still trying. The main part, the first RegEx, was gratefully received and adapted [ in other words copied and pasted :) ] from this article on 'Mastering regular expressions in PHP'.
Listing 4. Look-ahead example
                
$tab_data = preg_replace( '/
    ,                               # look for a comma
    (?=                             # then look ahead for
        (?:[^"]*$)                  # a string with no quotes and eol
        |                           #  -or-
        (?:[^"]*"[^"]*"[^"]*)*$     # a string with balanced quotes
    )                               # 
    /x', "\t", $csv_data );

Though the first RegEx is the main part of the code, I spent quite some time on the 4th one as well:

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

It's so simple: Find the quote ( " ) at the end of String ( $ ).
But I could not make it work, at first. I also checked  with the Program DEMO_REGEX_TOY and could not find any problem with RegEx. Can you guess the problem? The hint is very clear in the code.

Later I realised that $ will not find the end of line, if input is declared as char [e.g. g_input(100) type C ] and not as a String.

I like writing about technical topics not that I don't have my own thoughts, ideas and philosophies on non-technical subjects. Good thing about technical topics is that you can be right or wrong or just plain absurd but there is little chance of being politically incorrect. Also when it comes to non-technical topics, it's rather difficult to put your ideas, thoughts into words without being misinterpreted. That's the reason why I respect people who managed to do that reasonably well.

I would like to point you towards  a Blog post by Thorsten Franz "Speaking about information gathering, preaching about collaboration". Though I reserve my view on these topics and my decision to express or not, I found that interesting and thought-provoking.

In my view, we should try to pursue the intersection of 'What we love doing' and 'Something that we are really good at'. [ Of course it should be legal and ethical, that goes without saying ]. Though it won’t hurt if you could improve on other aspects, we should try to identify and utilise the potential of our 'Core Competency'. I hear you say...Management Jargon...cut the crap.

As I mentioned 'Core Competency', that reminds me of my 'SOCIALVIBE' widget. I recently tried 'SOCIALVIBE' widget to support a charity. Well, if I can utilise my Blog's potential to generate some money for a charity then there is no harm, right? And that's acceptable even as per Ayn Rand's philosophy. Not that I am seeking approval.

However, I realised there is hardly much potential there. I get around 100 visits/day on this Blog but it could not generate even 1 hour of schooling for the children in past two weeks. Not sure if it's due to lack of potential in my Blog though.

I would suggest that the charity should take care of their core area and leave the revenue generation through online ads on someone who is good at that. Compared to Google Adsense which is based on keywords, 'SOCIALVIBE' widget's ads does not appeal or add value to the target audience. Basically, I will have to remove this 'SOCIALVIBE' widget because it does not work.

There are several areas around Charity that should be run like a Business. A quote from Shai Agassi's blog:
My reaction was that Bill Gates changed the rules of the game, when he decided to move into philanthropy and now all the silicon valley geeks found themselves in a new race. The race was defined by not who can give the most, rather by who can generate the biggest impact per buck spent.
By the way, If you’ve not read before then some of these books, I think, can help in developing a balanced perspective: Atlas Shrugged & Fountainhead by Ayn Rand. Major Barbara by George Bernard Shaw.

Back to the topic of this Blog; while writing this post, I thought of clearing some of my earlier assumptions. So here is that quick adaptations using tab instead of ';'.
REPORT  z_ram_test_csv_to_tab_regex.

DATA : g_input TYPE string.
DATA : l_tab TYPE c value cl_abap_char_utilities=>HORIZONTAL_TAB.
DATA : l_regex type string.

g_input = '"F1",F2,""Hello",  "How" are "you"","F3"'.

REPLACE ALL OCCURRENCES OF
  REGEX ',(?=(?:[^"]*$)|(?:[^"]*"[^"]*"[^"]*)*$)'
  IN g_input WITH l_tab.

REPLACE FIRST OCCURRENCE OF
  REGEX '^"'
  IN g_input WITH ' '.

CONCATENATE '"' l_tab '"|"' l_tab '|' l_tab '"' into l_regex.
REPLACE ALL OCCURRENCES OF
  REGEX l_regex                         "    '"l_tab"|"l_tab|l_tab"'
  IN g_input WITH l_tab.

REPLACE ALL OCCURRENCES OF '""' IN g_input
  WITH '"'.    "Added during testing

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

WRITE :/ g_input.

Now let's test it with some examples from Wiki. Here # represents a tab display.
Input : 1997,Ford,E350
Output : 1997#Ford#E350
Input : 1997,Ford,E350,"Super, luxurious truck"
Output : 1997#Ford#E350#Super, luxurious truck
Input : 1997,Ford,E350,"Super ""luxurious"" truck"
Output : 1997#Ford#E350#Super "luxurious" truck
- This test actually failed with the original code. Added the line below :
REPLACE ALL OCCURRENCES OF '""' IN g_input
WITH '"'.
Input : 1997,Ford,E350," Super luxurious truck "
Output : 1997#Ford#E350# Super luxurious truck
Input : "1997","Ford","E350"
Output : 1997#Ford#E350
Looking back at the code, I think we can minimise the number of lines by merging some of the Replace Regex operations. May be later.

3 comments:

Anonymous said...

Ayn Rand's philosphy is too simplistic for me. At the end of day it involves kicking the weaker to the streets. The best examples to me of such ideas in action are India (cheap labor ueber alles) or the U.S. (keep the "coloreds" down).

Ayn Rand's ojbectivism appeals mainly to young people searching and the already rich. Life is complicated not simple. You allude to that when describing how programming is mostly simple, i.e. wrong or right. I said that to someone a bit differently - "it is too bad we cannot read people's intentions like a file system."

Thanks for sharing some of your code and thoughts.

Ram Manohar Tiwari said...
This post has been removed by the author.
Ram Manohar Tiwari said...

Thanks for your comments.
And I hope you are not my wife :) Because just a few days back we discussed this topic with similar arguments but I eventually managed to explain that her interpretation is incorrect.
Actually, I don't think any philosophy can really be that simple. Even Ayn Rand's philosophy is wrongly interpreted.

***At the end of day it involves kicking the weaker to the streets***

No that's not the philosophy of Ayn Rand. it's about to model yourself to be a hero in your chosen field and respect others who are sharing the burden.

However, it does condemn those who act more like parasites [ The philosophy as per that "It's a moral duty of Intelligent and intellectuals to serve the idiots". ] Also, I don't know what made you think that Ayn Rand's philosophy is in action and that too in a so called socialist India.

***Ayn Rand's ojbectivism appeals mainly to young people searching and the already rich. ***

I've seen this as an argument on many discussions. Who did this survey by the way? Because I think it might just be right.

Remember when in a corporate environment, we answer to a new joiner...."it's a good idea, even I thought of this before..but you know what ..it does not work". Not that it's a bad idea but it's hard to change the way corporate environment works..and then you mention several problems..the problem which were there with you for so long, you are now suffering from 'Stockholm syndrome'.

Thanks for your comment again,
Ram

Post a Comment

comment

Copyright

MyFreeCopyright.com Registered & Protected