Saturday, September 19, 2009

ABAP RegEx met CSV and said 'I'll keep a Tab on you'


I wrote about XML processing in my previous posts:
As much as we like to use XML for data-exchange, there are still a lot of legacy(pre-XML) data files floating between the systems. CSV files are used for ages and there are already some functions available in ABAP for the purpose. But all these standard or custom methods of parsing CSV, require more than 10 lines of code.
[ Image and comic from XKCD ]



Arrival of RegEx [ Regular Expressions ] in ABAP should be able to put an end to the miseries, developers have been facing with string ( match/find/replace ) operations.

If you would like to start with simple code, see some examples here. Also, in my attempt to learn RegEx by examples, I tried to write a CSV parser in ABAP and the code-snippet for converting the CSV string-line is as below:






REPORT  z_ram_test_csv_to_itab_regex.

DATA : g_input TYPE string. "g_input(100) TYPE C.

g_input = '"F1",F2,""Hello",  "How" are "you"","F3"'.

REPLACE ALL OCCURRENCES OF
  REGEX ',(?=(?:[^"]*$)|(?:[^"]*"[^"]*"[^"]*)*$)'
  IN g_input WITH ';'.

REPLACE FIRST OCCURRENCE OF
  REGEX '^"'
  IN g_input WITH ' '.

REPLACE ALL OCCURRENCES OF
  REGEX '";"|";|;"'
  IN g_input WITH ';'.

REPLACE ALL OCCURRENCES OF '""' IN g_input
  WITH '"'.    "Added during testing

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

WRITE :/ g_input. 
"Now you can use SPLIT on ';' to get the values in itab
To keep it simple for the explanation, I replaced comma ( , ) in CSV with ( ; ). However,  Tab ( cl_abap_char_utilities=>HORIZONTAL_TAB ) should be used for productive code. I've not done complete testing of this code so can't be sure if it complies with all the CSV specifications. But in case you managed to test this code then please pass on your feedback as comments. And please do test this before using it for production purpose. Test cases can be adapted from the Wiki examples.

Some developers simply split the string on comma, without realising that there can be a comma within one of the text-fields. If not handled carefully, this could be really dangerous and potentially corrupt the database. Till your custom code is properly tested, or if you have an old SAP release, you should stick with the sap standard functions for parsing data. e.g. Function KCD_CSV_FILE_TO_INTERN_CONVERT or a custom adaptation of the same function.

OK so I don't understand much of this code either ;) . But I am still trying. The main part, the first RegEx, was gratefully received and adapted [ in other words copied and pasted :) ] from this article on 'Mastering regular expressions in PHP'.
Listing 4. Look-ahead example
                
$tab_data = preg_replace( '/
    ,                               # look for a comma
    (?=                             # then look ahead for
        (?:[^"]*$)                  # a string with no quotes and eol
        |                           #  -or-
        (?:[^"]*"[^"]*"[^"]*)*$     # a string with balanced quotes
    )                               # 
    /x', "\t", $csv_data );

Though the first RegEx is the main part of the code, I spent quite some time on the 4th one as well:

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

It's so simple: Find the quote ( " ) at the end of String ( $ ).
But I could not make it work, at first. I also checked  with the Program DEMO_REGEX_TOY and could not find any problem with RegEx. Can you guess the problem? The hint is very clear in the code.

Later I realised that $ will not find the end of line, if input is declared as char [e.g. g_input(100) type C ] and not as a String.

I like writing about technical topics not that I don't have my own thoughts, ideas and philosophies on non-technical subjects. Good thing about technical topics is that you can be right or wrong or just plain absurd but there is little chance of being politically incorrect. Also when it comes to non-technical topics, it's rather difficult to put your ideas, thoughts into words without being misinterpreted. That's the reason why I respect people who managed to do that reasonably well.

I would like to point you towards  a Blog post by Thorsten Franz "Speaking about information gathering, preaching about collaboration". Though I reserve my view on these topics and my decision to express or not, I found that interesting and thought-provoking.

In my view, we should try to pursue the intersection of 'What we love doing' and 'Something that we are really good at'. [ Of course it should be legal and ethical, that goes without saying ]. Though it won’t hurt if you could improve on other aspects, we should try to identify and utilise the potential of our 'Core Competency'. I hear you say...Management Jargon...cut the crap.

As I mentioned 'Core Competency', that reminds me of my 'SOCIALVIBE' widget. I recently tried 'SOCIALVIBE' widget to support a charity. Well, if I can utilise my Blog's potential to generate some money for a charity then there is no harm, right? And that's acceptable even as per Ayn Rand's philosophy. Not that I am seeking approval.

However, I realised there is hardly much potential there. I get around 100 visits/day on this Blog but it could not generate even 1 hour of schooling for the children in past two weeks. Not sure if it's due to lack of potential in my Blog though.

I would suggest that the charity should take care of their core area and leave the revenue generation through online ads on someone who is good at that. Compared to Google Adsense which is based on keywords, 'SOCIALVIBE' widget's ads does not appeal or add value to the target audience. Basically, I will have to remove this 'SOCIALVIBE' widget because it does not work.

There are several areas around Charity that should be run like a Business. A quote from Shai Agassi's blog:
My reaction was that Bill Gates changed the rules of the game, when he decided to move into philanthropy and now all the silicon valley geeks found themselves in a new race. The race was defined by not who can give the most, rather by who can generate the biggest impact per buck spent.
By the way, If you’ve not read before then some of these books, I think, can help in developing a balanced perspective: Atlas Shrugged & Fountainhead by Ayn Rand. Major Barbara by George Bernard Shaw.

Back to the topic of this Blog; while writing this post, I thought of clearing some of my earlier assumptions. So here is that quick adaptations using tab instead of ';'.
REPORT  z_ram_test_csv_to_tab_regex.

DATA : g_input TYPE string.
DATA : l_tab TYPE c value cl_abap_char_utilities=>HORIZONTAL_TAB.
DATA : l_regex type string.

g_input = '"F1",F2,""Hello",  "How" are "you"","F3"'.

REPLACE ALL OCCURRENCES OF
  REGEX ',(?=(?:[^"]*$)|(?:[^"]*"[^"]*"[^"]*)*$)'
  IN g_input WITH l_tab.

REPLACE FIRST OCCURRENCE OF
  REGEX '^"'
  IN g_input WITH ' '.

CONCATENATE '"' l_tab '"|"' l_tab '|' l_tab '"' into l_regex.
REPLACE ALL OCCURRENCES OF
  REGEX l_regex                         "    '"l_tab"|"l_tab|l_tab"'
  IN g_input WITH l_tab.

REPLACE ALL OCCURRENCES OF '""' IN g_input
  WITH '"'.    "Added during testing

REPLACE ALL OCCURRENCES OF REGEX '"$' IN g_input
  WITH ' '.

WRITE :/ g_input.


Now let's test it with some examples from Wiki. Here # represents a tab display.
Input : 1997,Ford,E350
Output : 1997#Ford#E350
Input : 1997,Ford,E350,"Super, luxurious truck"
Output : 1997#Ford#E350#Super, luxurious truck
Input : 1997,Ford,E350,"Super ""luxurious"" truck"
Output : 1997#Ford#E350#Super "luxurious" truck
- This test actually failed with the original code. Added the line below :
REPLACE ALL OCCURRENCES OF '""' IN g_input
WITH '"'.
Input : 1997,Ford,E350," Super luxurious truck "
Output : 1997#Ford#E350# Super luxurious truck
Input : "1997","Ford","E350"
Output : 1997#Ford#E350
Looking back at the code, I think we can minimise the number of lines by merging some of the Replace Regex operations. May be later.

14 comments:

  1. Ayn Rand's philosphy is too simplistic for me. At the end of day it involves kicking the weaker to the streets. The best examples to me of such ideas in action are India (cheap labor ueber alles) or the U.S. (keep the "coloreds" down).

    Ayn Rand's ojbectivism appeals mainly to young people searching and the already rich. Life is complicated not simple. You allude to that when describing how programming is mostly simple, i.e. wrong or right. I said that to someone a bit differently - "it is too bad we cannot read people's intentions like a file system."

    Thanks for sharing some of your code and thoughts.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Thanks for your comments.
    And I hope you are not my wife :) Because just a few days back we discussed this topic with similar arguments but I eventually managed to explain that her interpretation is incorrect.
    Actually, I don't think any philosophy can really be that simple. Even Ayn Rand's philosophy is wrongly interpreted.

    ***At the end of day it involves kicking the weaker to the streets***

    No that's not the philosophy of Ayn Rand. it's about to model yourself to be a hero in your chosen field and respect others who are sharing the burden.

    However, it does condemn those who act more like parasites [ The philosophy as per that "It's a moral duty of Intelligent and intellectuals to serve the idiots". ] Also, I don't know what made you think that Ayn Rand's philosophy is in action and that too in a so called socialist India.

    ***Ayn Rand's ojbectivism appeals mainly to young people searching and the already rich. ***

    I've seen this as an argument on many discussions. Who did this survey by the way? Because I think it might just be right.

    Remember when in a corporate environment, we answer to a new joiner...."it's a good idea, even I thought of this before..but you know what ..it does not work". Not that it's a bad idea but it's hard to change the way corporate environment works..and then you mention several problems..the problem which were there with you for so long, you are now suffering from 'Stockholm syndrome'.

    Thanks for your comment again,
    Ram

    ReplyDelete
  4. goog job! I find your examples helpful. Best regards.

    ReplyDelete
  5. Great work! The code helped me a lot to convert CSV into internal table.

    I was trying the code for the following text and it's not giving me the correct results. When I use split at cl_abap_char_utilities=>HORIZONTAL_TAB, I get 11 records in internal table but I expected to get 10 records.

    6_1681,QUAL-Q,SP12-BC Building Controller,"SP12-BC Building, Controller#",,,,20131122,I,20131122

    ReplyDelete
    Replies
    1. To be precise, here's how internal table looks like.

      Row Value
      ------ ------------------------------------------
      1 6_1681
      2 QUAL-Q
      3 SP12-BC Building Controller
      4 SP12-BC Building
      5 Controller#
      6
      7
      8
      9 20131122
      10 I
      11 20131122

      Delete
    2. hmm..we normally don't expect a tab character within the data-fields :)
      Looks like your input string itself has a tab "SP12-BC Building, Controller#" ( # seems to be the tab char, displayed as # in debugger or SAP display screens ).
      I think, you should remove all the horizontal tabs in input data at the beginning of code - you probably don't want the tab values to be updated in the SAP table fields.

      Cheers,
      Ram



      Delete
    3. ..so I think you can add following as 2nd line of the program code - just after getting g_input string, replace tabs with space or remove. e.g.
      REPLACE FIRST OCCURRENCE OF
      cl_abap_char_utilities=>HORIZONTAL_TAB
      IN g_input WITH space.

      Delete
    4. Ram,
      When I tried to use 'REPLACE ALL OCCURRENCES OF cl_abap_char_utilities=>horizontal_tab IN p_text WITH space.', it does not find any tab char( sy-subrc = 4).

      Thanks,
      Z

      Delete
    5. here's how text looks like If I open in Notepad.It looks like a line feed or carriage return. I tried to use replace with CL_ABAP_CHAR_UTILITIES=>CR_LF and
      CL_ABAP_CHAR_UTILITIES=>NEWLINE but no luck.

      6_1681,QUAL-Q,SP12-BC Building Controller,"SP12-BC Building, Controller
      ",,,,20131122,I,20131122

      Delete
    6. ..or check the HEX character in ABAP debugger for # in input string :)

      Delete
  6. ..worth checking the file in Notepad++ hex display etc to find out the Hex value of that character?

    ReplyDelete
  7. Thanks,Ram!

    It's working now with the following code:

    data: v_crlf TYPE abap_cr_lf,
    v_cr TYPE abap_char1.

    v_crlf = cl_abap_char_utilities=>cr_lf.
    v_cr = v_crlf+0(1).

    REPLACE ALL OCCURRENCES OF v_cr IN p_text WITH space.

    ReplyDelete
    Replies
    1. ...interesting. if that's the case then "SPLIT into itab" has a problem at your version (ABAP level /Unicode etc).
      I am unable to replicate it with a prepared string like one below and then spiting it at tab . well, never mind :)

      * Prepare a string having a CR just before a tab
      CONCATENATE 'A' cl_abap_char_utilities=>HORIZONTAL_TAB 'B'
      v_cr cl_abap_char_utilities=>HORIZONTAL_TAB 'C' into g_input.

      Delete

Info : Comments are reviewed before they are published. SPAMs are rejected.

Copyright